WO2019023921A1 - 一种手势识别方法、装置及设备 - Google Patents

一种手势识别方法、装置及设备 Download PDF

Info

Publication number
WO2019023921A1
WO2019023921A1 PCT/CN2017/095388 CN2017095388W WO2019023921A1 WO 2019023921 A1 WO2019023921 A1 WO 2019023921A1 CN 2017095388 W CN2017095388 W CN 2017095388W WO 2019023921 A1 WO2019023921 A1 WO 2019023921A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
gesture recognition
images
recognition result
video segment
Prior art date
Application number
PCT/CN2017/095388
Other languages
English (en)
French (fr)
Inventor
王亮
许松岑
刘传建
何俊
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2017/095388 priority Critical patent/WO2019023921A1/zh
Priority to EP17920578.6A priority patent/EP3651055A4/en
Priority to KR1020207005925A priority patent/KR102364993B1/ko
Priority to CN201780093539.8A priority patent/CN110959160A/zh
Priority to BR112020001729A priority patent/BR112020001729A8/pt
Publication of WO2019023921A1 publication Critical patent/WO2019023921A1/zh
Priority to US16/776,282 priority patent/US11450146B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data

Definitions

  • the present application relates to the field of human-computer interaction technologies, and in particular, to a gesture recognition method, apparatus, and device.
  • Gesture input is an indispensable key technology for realizing natural and direct human-computer interaction.
  • the gesture recognition method based on computer vision has become a hot spot in today's research because it does not depend on equipment, more natural human-computer interaction effects, and better immersion.
  • the computer vision-based gesture recognition scheme is as follows: firstly, a gesture image video stream is captured by a camera, and the video stream is converted into an image frame; then the gesture is segmented and tracked from the image frame according to a specific image tracking algorithm. Shape, feature and position information, and finally recognize the gesture according to the shape, feature and position information of the extracted gesture, combined with the pre-established classification criteria.
  • the gesture in the image needs to be segmented and tracked, and the process of segmentation and tracking needs to consume more processing time, and the delay is too large.
  • an embodiment of the present application provides a gesture recognition method, apparatus, and device.
  • a gesture recognition method comprising: acquiring M images, the M images being extracted from a first video segment in a video stream, wherein the first video segment Any one of the video segments in the video stream, where M is an integer greater than or equal to 2; performing gesture recognition on the M images by using a depth learning algorithm to obtain a gesture recognition result corresponding to the first video segment; After the gesture recognition result of the consecutive N video segments in the video stream, the result of the gesture recognition result of the consecutive N video segments is obtained, and the merged gesture recognition result is obtained; ⁇ 2, and N is an integer.
  • the M images in the video segment are acquired, and the M image is gesture-recognized by the deep learning algorithm to obtain the gesture recognition corresponding to the video segment.
  • the gesture recognition result of the consecutive N video segments including the video segment is finally fused to obtain a gesture recognition result for the consecutive N video segments, that is, in the above identification process, it is not required in the video stream.
  • the gesture is divided and tracked. Instead, the fast learning algorithm is used to identify the motions of each phase, and the motions of each phase are combined to improve the speed of gesture recognition and reduce the delay of gesture recognition.
  • the result of the gesture recognition of the consecutive N video segments is fused to obtain the fused gesture recognition result, including:
  • a user when a user performs a gesture operation, it may be in the process of a gesture operation for a short time.
  • the gesture action that does not conform to the current gesture operation is performed, and after the gesture recognition result of each video segment is recognized by the foregoing possible implementation manner, the gesture motion trend indicated by the gesture recognition result of the consecutive multiple video segments may be obtained.
  • the final gesture recognition result eliminates the influence of the user's wrong gesture in a short time on the final gesture recognition result, thereby improving the accuracy of gesture recognition.
  • the first machine learning model is a neural network model, and the number of neurons of the neural network model is N; or the first machine learning model is a support vector machine SVM model.
  • the result of the gesture recognition of the consecutive N video segments is fused to obtain the fused gesture recognition result, including:
  • the gesture recognition result of the consecutive multiple video segments may be weighted and averaged according to the preset weights to reduce the wrong gesture pair of the user in a short time.
  • the effect of the final gesture recognition result is improved, thereby improving the accuracy of gesture recognition.
  • the gesture recognition is performed on the M image by using a depth learning algorithm, and the gesture recognition result corresponding to the first video segment is obtained, including:
  • the optical flow information image includes a first image of the M images and a first image before the first image
  • the optical flow information including instantaneous velocity vector information of pixel points in the image, and learning through the first depth
  • the algorithm performs gesture recognition on the optical flow information image to obtain a first recognition result, where p is an integer greater than or equal to 1; performing image processing on the M images to obtain a color information image corresponding to the first video segment,
  • the color information image includes color information of the M images, the color information includes color values of respective pixels in the image, and gesture recognition is performed on the color information image by a second depth learning algorithm to obtain a second Identifying a result; combining the first recognition result and the second recognition result to obtain a gesture recognition result of the first video segment.
  • the above possible implementation scheme extracts optical flow information and color information of the video segment according to the M image, and separately performs gesture recognition according to the extracted optical flow information and color information, and then fuses the separately recognized gesture recognition results, thereby improving the adoption.
  • a single deep learning algorithm identifies inaccurate gestures to improve the accuracy of the gesture recognition results for video segments.
  • the performing image processing on the M images to obtain an optical flow information image corresponding to the first video segment includes:
  • For the first image acquiring a p-th image before the first image in the video stream according to a preset rule; calculating optical flow information between the first image and the p-th image, And generating an optical flow information image including optical flow information between the first image and the p-th image; wherein a time interval between the first image and the p-th image is not less than a forward calculation time of the first depth learning algorithm and a time required to calculate the optical flow information image;
  • the first image For the first image, acquiring all p images in the video stream before the first image according to a preset rule; calculating each of the first image and the adjacent two images in the M image Optical flow information, each of which will be described After the optical flow information between the two adjacent images is accumulated, an optical flow information image including the accumulated optical flow information is generated; wherein, between the first image and the p-th image before the first image
  • the time interval is not less than the forward calculation time of the first depth learning algorithm and the time required to calculate the optical flow information image.
  • the optical flow information image between the current image and the pth image before the current image may be obtained for subsequent pass depth.
  • the learning algorithm performs gesture recognition on the optical flow information image, and does not need to segment and track the gesture in the image, thereby simplifying the processing process of the gesture recognition result, improving the speed of gesture recognition, and reducing the delay of gesture recognition.
  • the performing image processing on the M images to obtain the color information image corresponding to the first video segment includes:
  • the m images are random m images in the M images, or the m images are in the M images, respectively, in the video stream
  • the previous image changes the largest m image, m is an integer greater than or equal to 1;
  • the color information is generated according to the new color information at the identified pixel position to generate a color information image corresponding to the first video segment.
  • the method before the acquiring the M image, the method further includes:
  • the gesture operation necessarily involves a gesture action
  • the above-mentioned possible implementation scheme is preferred to determine the video segment by using an image in the video segment and at least one image before the image before performing gesture recognition on the video segment. Whether an action occurs or not, and when it is determined that an action occurs, the subsequent recognition operation is performed, thereby reducing unnecessary recognition steps, saving computational resources, and improving recognition efficiency.
  • the determining, according to the last image in the time window and the at least one reference image, whether an action occurs in the first video segment comprises:
  • the merging the first recognition result and the second recognition result to obtain the gesture recognition result of the first video segment includes:
  • a gesture recognition apparatus having the functionality of implementing the gesture recognition method provided by the first aspect and the possible implementations of the first aspect.
  • the functions may be implemented by hardware or by corresponding software implemented by hardware.
  • the hardware or software includes one or more than one unit corresponding to the functions described above.
  • a gesture recognition apparatus comprising: a processor and a memory; the processor in the apparatus, by executing a program or an instruction stored in the memory, to implement the first aspect and the first aspect Implement the gesture recognition method provided by the solution.
  • a computer readable storage medium storing an executable program executed by a processor to implement the first aspect and the possible implementation of the first aspect is provided The gesture recognition method provided.
  • FIG. 1 is an architectural diagram of a gesture recognition system according to the present application.
  • FIG. 2 is a schematic diagram of gesture recognition according to the embodiment shown in FIG. 1;
  • FIG. 2 is a schematic diagram of gesture recognition according to the embodiment shown in FIG. 1;
  • FIG. 3 is a flowchart of a method for a gesture recognition method provided by an exemplary embodiment of the present application
  • FIG. 4 is a schematic diagram of two time window spans involved in the embodiment shown in FIG. 3;
  • FIG. 5 is a schematic diagram of a fusion of recognition results according to the embodiment shown in FIG. 3;
  • FIG. 5 is a schematic diagram of a fusion of recognition results according to the embodiment shown in FIG. 3;
  • FIG. 6 is a schematic flow chart of a gesture recognition involved in the embodiment shown in FIG. 3;
  • FIG. 7 is a schematic structural diagram of a gesture recognition apparatus according to an exemplary embodiment of the present disclosure.
  • FIG. 8 is a structural block diagram of a gesture recognition apparatus according to an exemplary embodiment of the present application.
  • FIG. 1 is a system architecture diagram of a gesture recognition system according to an embodiment of the present application.
  • the gesture recognition system can include the following devices: an image capture device 110 and a gesture recognition device 120.
  • Image acquisition device 110 can be a camera.
  • the image capturing device 110 may be a single camera, or the image capturing device 110 may also be a camera module composed of two or more cameras.
  • the image capturing device 110 may be fixedly disposed. Alternatively, the image capturing device 110 may be integrated with a servo motor. The servo motor may drive the image capturing device 110 to rotate or move under the control of the gesture recognition device to change the image capturing device 110. Angle or shooting position.
  • the gesture recognition device 120 can be a general purpose computer, or the gesture recognition device can also be an embedded computing device.
  • the image collection device 110 and the gesture recognition device 120 may be independent devices, and the image collection device is provided.
  • the device 110 and the gesture recognition device 120 are connected by a wired or wireless network.
  • the image capture device 110 and the gesture recognition device 120 may also be integrated in the same physical device, and the image capture device 110 and the gesture recognition device 120 are connected by a communication bus.
  • the gesture recognition device 120 transmits the recognized gesture to the control device 130, and the control device 130 determines a corresponding control instruction according to the recognized gesture, according to the determined control instruction.
  • the corresponding control operation is performed, for example, controlling the graphic display according to the control instruction, or controlling the controlled device to perform an operation according to the control instruction, and the like.
  • the image capturing device 110 transmits the collected video stream to the gesture recognition device 120, and the gesture recognition device 120 performs image analysis and gesture recognition on the video stream to instantly recognize the gesture in the video stream.
  • FIG. 2 illustrates a schematic diagram of gesture recognition according to an embodiment of the present application.
  • the gesture recognition device 120 may extract M images (where M is an integer greater than or equal to 2) from a video segment of the video stream, and the gesture recognition device 120 passes the deep learning.
  • the algorithm performs gesture recognition on the M image to obtain a gesture recognition result corresponding to the video segment, and obtains a gesture recognition result of consecutive N video segments including the video segment in the video stream, and the continuous N videos are obtained.
  • the result of the gesture recognition of the segment is combined to obtain the gesture recognition result after the fusion.
  • a complete gesture action is divided into multiple phase actions, each phase action is identified by a deep learning algorithm, and finally the recognized phase actions are merged into a complete gesture action.
  • identification process it is not necessary to segment and track the gestures in the video stream, but to identify the motion of each phase by calculating a fast depth learning algorithm, thereby improving the speed of gesture recognition and reducing the delay of gesture recognition. effect.
  • the above-mentioned depth learning algorithm is a two-channel deep learning algorithm based on optical flow information and color information.
  • FIG. 3 is a flowchart of a method for the gesture recognition method provided by an exemplary embodiment of the present application. As shown in FIG. 3, the gesture recognition method may include the following steps:
  • Step 301 Determine, for a first video segment in the video stream, a time window of a preset time length in the video stream, where the end time of the time window is within a time period corresponding to the first video segment.
  • the first video segment is any video segment in the video stream.
  • the gesture recognition device may divide the video stream into a plurality of video segments connected end to end, and separately perform gesture recognition for each video segment.
  • the video stream is composed of a series of video images corresponding to different time points.
  • the time window may be a time window between time points corresponding to two different video images, that is, the video stream is in the time window.
  • the length of time between the time point corresponding to the first image and the time point corresponding to the last image in the time window is the preset time length.
  • the last image in the time window is an image in the first video segment to be identified, and the other images in the time window of the video stream may be the image in the first video segment, or may be the video. An image in the stream that precedes the first video segment.
  • FIG. 4 illustrates two time window span diagrams involved in the embodiment of the present application.
  • the start time of the time window is t 1
  • the end time is t 2
  • the first video segment The starting time is t 3 and the ending time is t 4 .
  • t 1 and t 2 are between t 3 and t 4 , that is, the time window is completely within the first video segment.
  • t 2 is between t 3 and t 4
  • t 1 is before t 3 , that is, the time window portion is in the first video segment.
  • the other part is before the first video segment.
  • the above preset time length may be preset in the gesture recognition device for the developer.
  • Step 302 Determine, according to the last image in the time window and the at least one reference image, whether an action occurs in the first video segment; if yes, go to step 303; otherwise, return to step 301 to determine the next preset time.
  • the time window of length The time window of length.
  • the at least one reference image is any one of the time windows except the last image.
  • the gesture recognition device determines the first video segment according to a difference between a last image of the video stream in the time window and another at least one image of the video stream in the time window. Is there an action in the middle?
  • the step of determining whether the action occurs in the first video segment according to the last image in the time window and the other at least one image in the time window may be divided into the following sub-steps:
  • Step 302a calculating a partial guide image of the last image for each of the at least one reference image, the value of each pixel in the partial image being the corresponding pixel in the last image The value of the partial derivative relative to the value of the corresponding pixel in the reference image.
  • the image of one frame at time t 0 is f(x, y, t 0 ), wherein the image at time t 0 is the last image in the above time window, and the image at the previous q time is f (x) , y, t 0 -q), the gesture recognition device calculates the partial derivative of the video stream with respect to time t at time t 0 relative to time t 0 -q:
  • Step 302b normalizing the values of the pixels in the partial derivative image to obtain a normalized partial derivative image.
  • Step 302c Perform binarization processing on the normalized partial derivative image according to a preset binarization threshold to obtain a binarized image of the partial derivative image, and values of each pixel in the binary image Is 0 or 1.
  • the normalized partial derivative image is binarized, and the value of each pixel in the normalized partial derivative image is binarized to 0 or 1, and the binarization formula is as follows:
  • Z is a preset binarization threshold, and for the value of the pixel in the normalized partial guide image g b (x, y, t 0 ), when the value of the pixel is greater than Z, The value of the pixel is binarized to 1, and when the value of the pixel is less than or equal to T, the value of the pixel is binarized to zero.
  • the preset binarization threshold is a preset value, and is a value between (0, 1).
  • the preset binarization threshold may be 0.5, or the preset binary value.
  • the threshold can also be 0.4 or 0.6 and so on.
  • the binarization threshold can be preset by the developer based on the actual processing effect.
  • Step 302d calculating a sum of gray values of respective pixels in the binarized image.
  • Step 302e when the sum of the gray values is greater than 0, it is determined that an action occurs in the first video segment.
  • the gesture recognition device calculates the sum of the gray values of S b (x, y, t 0 ), Sum(t 0 ), when The sum Sum(t 0 ) is greater than 0 to determine that an action occurs in the first video segment. Otherwise, it is considered "no action" in the first video segment. Its formula is as follows:
  • Step 303 Acquire an M image, where the M images are M images extracted from the first video segment.
  • the gesture recognition device may extract M images from the first video segment, where M is an integer greater than or equal to 2.
  • the gesture recognition device may extract each image in the first video segment to obtain the M images.
  • the gesture recognition device may also extract an image every other image or images in the first video segment to obtain M images.
  • Step 304 Perform image processing on the M images to obtain an optical flow information image corresponding to the first video segment.
  • the optical flow information image includes optical flow information between a first image of the M images and a p-th image before the first image, the first image being any one of the M images, the optical flow
  • the information contains instantaneous velocity vector information of pixels in the image, and p is an integer greater than or equal to one.
  • the optical flow is the instantaneous speed of the pixel motion of the spatial moving object on the observation imaging plane
  • the gesture recognition device can use the change of the pixel in the image sequence in the time domain and the correlation between adjacent frames to find the previous image.
  • the calculated motion information of the object between the two images is the optical flow information between the two images.
  • the above method of calculating the motion information of an object between two images before and after is referred to as an optical flow method.
  • the optical flow information also called the optical flow field, refers to the apparent motion of the image gray mode, which is a two-dimensional vector field, and the information contained therein is the instantaneous motion velocity vector of each image point. Information, therefore, the optical flow information can be represented as a two-channel image of the same size as the meta image.
  • the gesture recognition device may utilize an RGB image sequence within the first video segment to obtain an optical flow information image (regardless of how many frames are included in the first video segment).
  • an RGB image sequence within the first video segment may be used to obtain an optical flow information image (regardless of how many frames are included in the first video segment).
  • the manner of obtaining the optical flow information image corresponding to the first video segment may be as follows:
  • For the first image of the M images acquiring a p-th image before the first image in the video stream according to a preset rule; calculating optical flow information between the first image and the p-th image And generating an optical flow information image including optical flow information between the first image and the p-th image.
  • the time interval between the first image and the p-th image is not less than a forward calculation time of the first depth learning algorithm and a time required to calculate the optical flow information image.
  • the first depth learning algorithm is an algorithm used by the gesture recognition device to recognize the gesture according to the optical flow information image.
  • the preset rule may be a rule set by the developer or the user, for example, the developer or the user may manually set the value of the above p.
  • the gesture recognition device may also set the value of the above p according to the processing performance of the device according to a preset rule; for example, the gesture recognition device may run the forward calculation of the first depth learning algorithm and the calculation of the optical flow information image in advance.
  • the gesture recognition device can determine the above The larger of the forward calculation time and the time of calculating the optical flow information image corresponds to the number of images in the video stream, and the value corresponding to the determined number of images is set to the value of p.
  • T is the time interval between a certain image and the p-th image before the image
  • the minimum value of T may be the depth learning network before the gesture recognition device performs gesture recognition through the optical flow information image. The larger of the time required for the calculation and the time required for the gesture recognition device to calculate the optical flow information image.
  • the gesture recognition device can use an Eulerian motion field algorithm according to an image I t (x, in the p-picture). y), and the p- th image I tT (x, y) before I t (x, y), directly calculate optical flow information corresponding to one image in the M image, and generate the calculated light
  • An optical flow information image of the flow information, and the optical flow information image is used as an optical flow information image corresponding to the first video segment. Its calculation formula can be simply expressed as follows:
  • U t (x, y) is an optical flow information image corresponding to the image I t (x, y).
  • OF[ ⁇ ] represents the above-mentioned Euler optical flow field algorithm.
  • the first image of the M images all the p images in the video stream before the first image are acquired according to a preset rule; and the first image and each adjacent image in the p image are calculated.
  • the optical flow information is accumulated, and the optical flow information between the adjacent two images is accumulated to generate an optical flow information image including the accumulated optical flow information.
  • the gesture recognition device can calculate an image I t (x, y) and I t (x) in the M image by using a Lagrangian motion field algorithm.
  • y The optical flow between each two adjacent images in the p-pictures I t-1 (x, y), I t-2 (x, y), ..., I tT (x, y) Information, then accumulating optical flow information between every two adjacent images, and generating an image U t (x, y) containing the accumulated optical flow information.
  • the process of accumulating optical flow information involves missing data interpolation, and interpolation methods such as linear, bilinear, and cubic can be selected.
  • U t (x, y) is an optical flow information image corresponding to the image I t (x, y).
  • OF[ ⁇ ] represents the above Lagrangian solar flow field algorithm.
  • mode one only one optical flow field needs to be calculated, so it is necessary to select a more accurate optical flow field algorithm.
  • mode two multiple optical flow fields need to be calculated, allowing the use of an optical flow field algorithm with low accuracy but fast speed.
  • Step 305 Perform image processing on the M images to obtain a color information image corresponding to the first video segment.
  • the color information image includes color information of the M images.
  • the color information includes color color values of respective pixels in the image.
  • the gesture recognition device processes the image sequence in the first video segment, and outputs m color information images, such as RGB (red green blue) images, to represent the color information image corresponding to the first video segment, where m is greater than or An integer equal to 1.
  • m color information images such as RGB (red green blue) images
  • the image recognition device can obtain the color information image by performing image processing on the M image to obtain the color information image corresponding to the first video segment:
  • the gesture recognition device extracts the color information of any one of the m images, the color information image corresponding to the arbitrary image is generated, and the generated color information image includes the color of the arbitrary image. information.
  • the m images are random m images in the M images.
  • the color information image corresponding to an image may be randomly selected from the first video segment to represent which is:
  • t-T is the time point corresponding to the first image in the first video segment
  • t is the time point corresponding to the last image in the first video segment
  • the gesture recognition device may further select the color information image of the m image as the first video segment by using other strategies.
  • Color information image For example, the gesture recognition device may use, as the color information image corresponding to the first video segment, the color information image of the m images in which the corresponding time is at the top or the last of the M images.
  • the m images may be the m images in the M images that vary the most with respect to the previous image in the video stream.
  • the gesture recognition device can detect pixels in the image that change compared to an image in the video stream that precedes the image; the gesture recognition device can render the M image
  • the color information image corresponding to the m images having the largest number of pixels that change with respect to the respective previous image is acquired as the color information image corresponding to the first video segment.
  • the gesture recognition device may further compare pixel points of the same position in the M image to identify a pixel position in which the image content in the M image changes with time (the pixel position may be The coordinates of the pixel in the image are averaged, and the color information of the pixel point corresponding to the identified pixel position in the M image is averaged, and new color information corresponding to the recognized pixel position is obtained, and a new color is generated.
  • the color information image wherein the color information corresponding to the identified pixel position in the new color information image is the new color information obtained by averaging the above.
  • the algorithm for detecting pixels that change in comparison with the previous image and the algorithm for detecting pixel positions that change with time in the image may be collectively referred to as a spatiotemporal saliency image detection algorithm.
  • the average value of the color information of each image in the first video segment is taken as the color information image of the first video segment.
  • the calculation formula can be as follows:
  • t-T is the time point corresponding to the first image in the first video segment
  • t is the time point corresponding to the last image in the first video segment
  • n is the number of images in the first video segment.
  • the pixel that changes is usually the foreground portion in the image (ie, the portion corresponding to the human hand), and the pixel corresponding to the background portion is generally invariant, and therefore, all or part of the above images correspond to each other.
  • the color information image the color information of the pixels corresponding to the background portion is generally the same as or similar to the average value of the color information at the portion, and the color information of the pixels corresponding to the background portion is generally different from the average value of the color information at the portion. Therefore, in the embodiment of the present application, the color information of each pixel in the color information image corresponding to each or part of the image may be subtracted from the average value of the corresponding pixel position, and all or part of the image may be obtained.
  • the color information image after the background portion is removed, and the gesture recognition device may respectively select all or part of the images, and the color information image after the background portion is removed as the color information image corresponding to the first video segment.
  • Step 306 Perform gesture recognition on the optical flow information image by using a first depth learning algorithm to obtain a first recognition result, and perform gesture recognition on the color information image by using a second depth learning algorithm to obtain a second recognition result.
  • a color information image for example, an RGB image
  • an optical flow information image may be obtained in a pre-step step according to the input video stream, and in this step 306, two depth learning models are respectively used for gesture recognition, and two The results of the deep learning model identification are merged in the next step.
  • the embodiment of the present application uses a two-channel deep learning model for gesture recognition, wherein one channel is a Temporal stream (corresponding to the first depth learning algorithm described above), the input is an optical flow information image, and finally the current optical flow information image is output.
  • Gesture recognition result for example, in the above step 304, after acquiring the optical flow information image of the image for each image in the M image, the optical flow information image is buffered in the optical flow
  • the gesture recognition device inputs the most recently stored X optical flow information images into the deep learning channel Temporal stream to output the gesture recognition result corresponding to the X optical flow information images, and outputs the gesture recognition result.
  • the gesture recognition device inputs the most recently stored X optical flow information images into the deep learning channel Temporal stream to output the gesture recognition result corresponding to the X optical flow information images, and outputs the gesture recognition result.
  • another channel is a spatial stream (corresponding to the second depth learning algorithm described above), and the input is obtained in step 305, indicating at least one color information image in the first video segment, and the output is A gesture recognition result for the at least one color information image.
  • the above two-channel deep learning model is a pre-trained machine learning model.
  • Step 307 merging the first recognition result and the second recognition result to obtain a gesture recognition result of the first video segment.
  • the gesture recognition result obtained by the optical flow information image and the gesture recognition result included in the color information image are the gesture recognition result of the same video segment, and therefore, the gesture recognition device acquires the light.
  • the two results may be fused to obtain the gesture recognition result of the first video segment.
  • One is to perform an average calculation on the first recognition result and the second recognition result, and calculate the calculation according to the average value.
  • the result of the calculation obtains the gesture recognition result of the first video segment.
  • the other is to input the first recognition result and the second recognition result into a pre-trained second machine learning model, such as a Support Vector Machine (SVM) model, to obtain gesture recognition of the first video segment. result.
  • the second machine learning model is a learning model for determining a single recognition result according to the input two recognition results, and the second machine learning model can be obtained by training the video segment of the gesture in advance.
  • the two recognition results may be two values
  • the gesture recognition device may input two values into the second machine learning model
  • the second machine learning model calculates one according to the pre-trained calculation formula and the input two values.
  • the fused value is output and the fused value is output as the gesture recognition result of the first video segment.
  • the gesture recognition device For each video segment, the gesture recognition device obtains a phased gesture recognition result corresponding to the video segment in real time and stores it in the temporary gesture recognition result library.
  • Step 308 after obtaining the gesture recognition result of the consecutive N video segments including the first video segment in the video stream, performing result fusion on the gesture recognition results of the consecutive N video segments to obtain the merged gesture recognition result.
  • N ⁇ 2 and N is an integer.
  • the gesture recognition device may input the gesture recognition result of the consecutive N video segments into the pre-trained first machine learning model to obtain the a merged gesture recognition result
  • the first machine learning model is configured to determine an overall gesture motion trend formed by the input consecutive N gesture recognition results, and output the gesture corresponding to the overall gesture motion trend as the merged gesture recognition result
  • the continuous N gesture recognition results may be N values
  • the gesture recognition device may input the N values into the first machine learning model according to the time sequence of the N video segments, and the first machine learning model follows the pre-trained calculation formula. And the N values input successively calculate a fused value, and the fused value is output as the fused gesture recognition result.
  • the first machine learning model is a neural network model, and the number of neurons of the neural network model is N; or the first machine learning model is a support vector machine SVM model.
  • the gesture recognition device may obtain a weight coefficient corresponding to each of the gesture recognition results of the consecutive N video segments. And performing weighted averaging on the gesture recognition results of the consecutive N video segments according to the weight coefficients corresponding to the gesture recognition results of the consecutive N video segments, to obtain the merged gesture recognition result.
  • the overall gesture movement trend is consistent with the gesture action that the user wants to make, but there may be a short period of time when the gesture does not conform to the gesture that the user wants to make. action.
  • the gesture operation that the user wants to make as an example, the user makes a gesture of raising the hand within 1 s, but within a short period of time (such as 0.2 s) within 1 s. The user does not raise his hand upwards, but presses the hand slightly downwards. After a short period of time, the user continues to raise his hand again.
  • the gesture recognition device recognizes the above-mentioned short period of time. The gesture recognition result does not match the gesture operation that the user currently wants to perform.
  • the gesture recognition device may combine the gesture recognition result of multiple consecutive video segments (ie, a sequence of gesture recognition results), and gestures of multiple video segments.
  • the overall gesture motion trend reflected by the recognition result is the result of the gesture recognition after the fusion.
  • the gesture recognition device calculates the N-action recognition phased result, and utilizes the N recognition stages.
  • the resulting fusion decision gives the final recognition result.
  • N*T 1 can take about 1 second.
  • r 1 , r 2 , . . . , r N are phase identification results, and the weight coefficients before each result are ⁇ 1 , ⁇ 2 , . . . , ⁇ N , and these weight coefficients may be coefficients determined in advance by a machine learning algorithm. Different combinations of coefficients will result in different combinations.
  • FIG. 5 is a schematic diagram of a fusion of recognition results according to an embodiment of the present application.
  • the phase identification results r 1 , r 2 , . . . , r N can be input to the pre-trained machine learning model, that is, the SVM module shown in FIG. 5 (SVM module)
  • SVM module the SVM module shown in FIG. 5
  • the SVM kernel function in the pre-set or trained kernel function outputs the fusion result.
  • the gesture recognition device can call the corresponding module according to the gesture recognition result (for example, a slide show, a full-screen picture of the picture, etc.) to achieve the purpose of human-computer interaction.
  • the gesture recognition result for example, a slide show, a full-screen picture of the picture, etc.
  • the gesture recognition device may not perform gesture recognition on the video segment to reduce the frequency of gesture recognition and avoid unnecessary recognition process. Specifically, the gesture recognition device may directly set the gesture recognition result in the video segment to be empty, or set the gesture recognition result of the video segment.
  • FIG. 6 is a schematic flowchart of a gesture recognition according to an embodiment of the present application.
  • the image capture device inputs the captured video stream into the gesture recognition device, and the gesture recognition device extracts an image in the video stream, as shown in FIG.
  • the gesture recognition device extracts the optical flow information image of the video segment according to the method of step 304 and step 305 for each image in a video segment in which the current image is located in the video stream (or may also be for a portion of the image).
  • the gesture recognition device fuses the N gesture recognition results by the method shown in step 308 to obtain a merged gesture recognition result.
  • the machine learning model mentioned above may be performed by pre-labeling the video samples of the corresponding gestures. Machine training to get.
  • the process of the above machine training can be implemented by the model training device.
  • the first machine learning model, the second machine learning model, and the two-channel deep learning model are all obtained by machine training as an example, in a possible implementation manner.
  • the developer can input several video stream samples into the model training device, each video stream sample contains a gesture, and the developer pre-marks the gestures in each video stream sample, and the developer will each video The stream is divided into multiple video segments, and the phased gesture corresponding to each video segment is marked.
  • the model training device extracts optical flow information images and colors for each video segment according to the schemes shown in steps 304 and 305.
  • the information image is input into the two-channel deep learning model of the optical flow information image and the color information image of the video segment, and the two recognition results output by the two-channel deep learning model and the phased gesture marked by the video segment are input into the second A machine learning model that models the two-channel deep learning model and the second machine learning model.
  • the model training device performs a machine training on the phased gesture of each video segment in the video stream sample and the pre-labeled gesture of the video stream sample into the first machine learning model to obtain the first A machine learning model.
  • the first machine learning model, the second machine learning model, and the two-channel deep learning model are all obtained by machine training.
  • the developer can input a number of models into the model training device. Video stream samples, each video stream sample contains a gesture, and the developer pre-labels the gestures in each video stream sample.
  • the model training device divides the video stream into multiple video segments and extracts each video segment.
  • the optical flow information image and the color information image and input the optical flow information image and the color information image of the video segment into the two-channel deep learning model, and input the two recognition results output by the two-channel deep learning model into the second machine learning model, and then
  • the first machine learning model outputs the phased gesture recognition result of the plurality of video segments
  • the model training device inputs the labeled gesture corresponding to the video stream into the first machine learning model to Simultaneously for the first machine learning model, the second machine learning model, and the two-channel deep learning model Line machine training.
  • the gesture recognition device can identify each video segment by other deep learning algorithms.
  • the gesture of the single video segment for example, the gesture recognition device may identify the gesture recognition result corresponding to the video segment only by the optical flow information image, or the gesture recognition device may also identify the gesture recognition result corresponding to the video segment only by the color information image.
  • the depth learning algorithm for identifying the result of the gesture recognition of the video segment is not limited in the embodiment of the present invention.
  • the method shown in the embodiment of the present application extracts an optical flow information image and a color information image for the video segment for each video segment in the video stream, and uses the deep learning algorithm to the optical flow information image and The color information image is separately recognized by the gesture.
  • the gesture recognition results corresponding to the two images are merged to determine the gesture recognition result corresponding to the video segment, and finally the continuous N videos including the video segment are included.
  • the gesture recognition result of the segment is merged to obtain a gesture recognition result for the consecutive N video segments, that is, in the above method, a complete gesture action is divided into multiple phase actions, and each phase is identified by a deep learning algorithm. In the end, the actions of the identified stages are merged into a complete gesture.
  • the gestures in the video stream do not need to be segmented and tracked, but the fast learning algorithm is used to identify each stage. Action, thereby achieving the speed of gesture recognition and reducing the delay of gesture recognition.
  • FIG. 7 is a schematic structural diagram of a gesture recognition apparatus 70 provided by an exemplary embodiment of the present application.
  • the gesture recognition apparatus 70 may be implemented as the gesture recognition apparatus 120 in the system shown in FIG. 1.
  • the gesture recognition device 70 can include a processor 71 and a memory 73.
  • the processor 71 may include one or more processing units, which may be a central processing unit (CPU) or a network processor (NP).
  • processing units may be a central processing unit (CPU) or a network processor (NP).
  • CPU central processing unit
  • NP network processor
  • the gesture recognition device 70 may further include a memory 73.
  • the memory 73 can be used to store a software program that can be executed by the processor 71.
  • various types of service data or user data can be stored in the memory 73.
  • the software program may include an image acquisition module, an identification module, and a fusion module; optionally, the software program may further include a time window determination module and a determination module;
  • the image acquisition module is executed by the processor 71 to implement the function of acquiring the M images extracted in the first video segment of the video stream in the embodiment shown in FIG. 3 above.
  • the identification module is executed by the processor 71 to implement the function of identifying the gesture recognition result corresponding to the first video segment in the embodiment shown in FIG. 3 above.
  • the fusion module is executed by the processor 71 to implement the function of merging the gesture recognition results of consecutive N video segments in the embodiment shown in FIG. 3 above.
  • the time window determination module is executed by the processor 71 to implement the functions described above with respect to determining the time window in the embodiment of FIG.
  • the judging module is executed by the processor 71 to implement the function of determining whether an action occurs in the first video segment in the embodiment shown in FIG. 3 above.
  • the gesture recognition device 70 may further include a communication interface 74, which may include a network interface.
  • the network interface is used to connect to the image collection device.
  • the network interface may include a wired network interface, such as an Ethernet interface or a fiber interface, or the network interface may also include a wireless network interface, such as a wireless local area network interface or a cellular mobile network interface.
  • Gesture recognition device 70 communicates with other devices via the network interface 74.
  • the processor 71 can be connected to the memory 73 and the communication interface 74 by a bus.
  • the gesture recognition device 70 may further include an output device 75 and an input device 77.
  • Output device 75 and input device 77 are coupled to processor 71.
  • the output device 75 can be a display for displaying information, a power amplifier device for playing sound, or a printer, etc.
  • the output device 75 can also include an output controller for providing output to a display screen, a power amplifier device, or a printer.
  • the input device 77 can be a device such as a mouse, keyboard, electronic stylus or touch panel for user input information, and the input device 77 can also include an output controller for receiving and processing from the mouse, keyboard, electronics Input from devices such as stylus or touch panel.
  • FIG. 8 is a structural block diagram of a gesture recognition apparatus according to an exemplary embodiment of the present application.
  • the gesture recognition apparatus may be implemented as part or all of a gesture recognition apparatus by a combination of hardware circuits or software and hardware.
  • the gesture recognition apparatus may be The gesture recognition device 120 in the embodiment shown in FIG. 1 above.
  • the gesture recognition apparatus may include an image acquisition unit 801, an identification unit 802, and a fusion unit 803.
  • the software program may further include a time window determination unit 804 and a determination unit 805.
  • the image obtaining unit 801 is executed by the processor to implement the function of acquiring the M images extracted in the first video segment of the video stream in the embodiment shown in FIG. 3 above.
  • the identification unit 802 is executed by the processor to implement the function of obtaining the gesture recognition result corresponding to the first video segment in the embodiment shown in FIG. 3 above.
  • the merging unit 803 is executed by the processor to implement the function of merging the gesture recognition results of the consecutive N video segments in the embodiment shown in FIG. 3 above.
  • the time window determining unit 804 is executed by the processor to implement the function of determining the time window in the embodiment shown in FIG. 3 described above.
  • the determining unit 805 is executed by the processor to implement the determination in the first video segment in the embodiment shown in FIG. No function has occurred.
  • the gesture recognition apparatus provided in the foregoing embodiment is only illustrated by the division of each functional unit in the gesture recognition. In actual applications, the function assignment may be performed by different functional units as needed. The internal structure of the device is divided into different functional units to complete all or part of the functions described above.
  • the method for the gesture recognition device and the method for recognizing the gesture are provided in the same concept, and the specific implementation process is described in detail in the method embodiment, and details are not described herein again.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Abstract

本申请提供了一种手势识别方法,涉及人机交互技术领域,所述方法包括:从视频流中的第一视频段中提取出的M幅图像;通过深度学习算法对该M幅图像进行手势识别,获得该第一视频段对应的手势识别结果,对包含第一视频段在内的连续N个视频段的手势识别结果进行结果融合,获得融合后的手势识别结果。在上述识别过程中,不需要对视频流中的手势进行分割和跟踪,而是通过计算速度较快的深度学习算法来识别各个阶段动作,再将各个阶段动作融合,从而达到提高手势识别的速度,降低手势识别的延迟的效果。

Description

一种手势识别方法、装置及设备 技术领域
本申请涉及人机交互技术领域,特别涉及一种手势识别方法、装置及设备。
背景技术
手势输入是实现自然、直接人机交互不可缺少的关键技术。基于计算机视觉的手势识别方法以其不依赖于设备,更自然的人机交互效果,更好的沉浸感成为当今研究的热点。
在相关技术中,基于计算机视觉的手势识别方案如下:首先通过摄像头拍摄手势图像视频流,并将视频流转化为图像帧;接着根据特定的图像跟踪算法从图像帧中分割并跟踪提取出手势的形状、特征以及位置信息,最后根据提取出的手势的形状、特征以及位置信息,结合预先建立的分类准则对手势进行识别。
在相关技术中,从图像帧中提取手势的形状、特征以及位置信息时,需要对图像中的手势进行分割和跟踪,而分割和跟踪的过程需要消耗较多的处理时间,延时过大。
发明内容
为了降低手势识别的延时,本申请的实施例提供了一种手势识别方法、装置及设备。
第一方面,提供了一种手势识别方法,所述方法包括:获取M幅图像,所述M幅图像是从视频流中的第一视频段中提取出的,其中,所述第一视频段是所述视频流中任意一个视频段,M为大于或等于2的整数;通过深度学习算法对所述M幅图像进行手势识别,获得所述第一视频段对应的手势识别结果;在获得所述视频流中包含所述第一视频段在内的连续N个视频段的手势识别结果后,对所述连续N个视频段的手势识别结果进行结果融合,获得融合后的手势识别结果;N≥2,且N为整数。
在上述手势识别方法中,针对视频流中的每一个视频段,获取该视频段中的M幅图像,并通过深度学习算法对该M幅图像进行手势识别,以获得该视频段对应的手势识别结果,最后将该视频段在内的连续N个视频段的手势识别结果进行融合,获得对该连续N个视频段的的手势识别结果,即在上述识别过程中,不需要对视频流中的手势进行分割和跟踪,而是通过计算速度较快的深度学习算法来识别各个阶段动作,并将各个阶段动作进行融合,从而达到提高手势识别的速度,降低手势识别的延迟的效果。
在一种可能的实现方案中,所述对所述连续N个视频段的手势识别结果进行结果融合,获得融合后的手势识别结果,包括:
将所述连续N个视频段的手势识别结果输入预先训练的第一机器学习模型,获得所述融合后的手势识别结果,所述第一机器学习模型用于确定输入的连续N个手势识别结果所构成的整体手势运动趋势,并将所述整体手势运动趋势对应的手势输出为所述融合后的手势识别结果。
在实际应用中,用户在执行某个手势操作时,可能在一个手势操作的过程中,短时间 内做出不符合当前手势操作的手势动作,而通过上述可能的实现方案,在识别出各个视频段的手势识别结果后,可以根据连续多个视频段的手势识别结果所指示的手势运动趋势获得最终的手势识别结果,消除用户在短时间内的错误手势对最终获得的手势识别结果的影响,从而提高手势识别的准确度。
在一种可能的实现方案中,所述第一机器学习模型为神经网络模型,且所述神经网络模型的神经元数量为N;或者,所述第一机器学习模型为支持向量机SVM模型。
在一种可能的实现方案中,所述对所述连续N个视频段的手势识别结果进行结果融合,获得融合后的手势识别结果,包括:
获取预先设置的,所述连续N个视频段的手势识别结果各自对应的权重系数;
根据所述连续N个视频段的手势识别结果各自对应的权重系数,对所述连续N个视频段的手势识别结果进行加权平均,获得所述融合后的手势识别结果。
通过上述可能的实现方案,在识别出各个视频段的手势识别结果后,可以根据预先设置的权重对连续多个视频段的手势识别结果进行加权平均,以减弱用户在短时间内的错误手势对最终获得的手势识别结果的影响,从而提高手势识别的准确度。
在一种可能的实现方案中,所述通过深度学习算法对所述M幅图像进行手势识别,获得所述第一视频段对应的手势识别结果,包括:
对所述M幅图像进行图像处理,获得所述第一视频段对应的光流信息图像,所述光流信息图像包含所述M幅图像中的第一图像与所述第一图像之前的第p幅图像之间的光流信息,所述第一图像是所述M幅图像中的任意一幅,所述光流信息包含图像中的像素点的瞬时速度矢量信息,并通过第一深度学习算法对所述光流信息图像进行手势识别,获得第一识别结果,p为大于或等于1的整数;对所述M幅图像进行图像处理,获得所述第一视频段对应的彩色信息图像,所述彩色信息图像包含所述M幅图像的彩色信息,所述彩色信息包含图像中的各个像素点的色值,并通过第二深度学习算法对所述彩色信息图像进行手势识别,获得第二识别结果;对所述第一识别结果和所述第二识别结果进行融合,获得所述第一视频段的手势识别结果。
上述可能的实现方案根据M幅图像提取视频段的光流信息和彩色信息,并根据提取的光流信息和彩色信息分别进行手势识别,再将分别识别出的手势识别结果进行融合,改善了通过单一的深度学习算法识别出的手势不准确的问题,以提高对视频段的手势识别结果的准确性。
在一种可能的实现方案中,所述对所述M幅图像进行图像处理,获得所述第一视频段对应的光流信息图像,包括:
对于所述第一图像,按预设规则获取所述视频流中处于所述第一图像之前的第p幅图像;计算所述第一图像与所述第p幅图像之间的光流信息,并生成包含所述第一图像与所述第p幅图像之间的光流信息的光流信息图像;其中,所述第一图像与所述第p幅图像之间的时间间隔不小于所述第一深度学习算法的前向计算时间以及计算所述光流信息图像所需的时间;
或者,
对于所述第一图像,按预设规则获取所述视频流中处于所述第一图像之前的全部p幅图像;计算所述第一图像以及所述M幅图像中每相邻两幅图像之间的光流信息,将所述每 相邻两幅图像之间的光流信息进行累加后,生成包含累加后的光流信息的光流信息图像;其中,所述第一图像与所述第一图像之前的第p幅图像之间的时间间隔不小于所述第一深度学习算法的前向计算时间以及计算所述光流信息图像所需的时间。
在上述可能的实现方案中,根据当前获取到的图像,以及当前图像之前的p幅图像,即可以获得当前图像与当前图像之前的第p幅图像之间的光流信息图像,以便后续通过深度学习算法对光流信息图像进行手势识别,不需要对图像中的手势进行分割和跟踪,从而简化了手势识别结果的处理过程,提高手势识别的速度,降低了手势识别的延迟。
在上述可能的实现方案中,所述对所述M幅图像进行图像处理,获得所述第一视频段对应的彩色信息图像,包括:
提取所述M幅图像中的m幅图像的彩色信息,根据提取到的彩色信息生成所述m幅图像各自对应的彩色信息图像,将所述m幅图像各自对应的彩色信息图像获取为所述第一视频段对应的彩色信息图像;所述m幅图像是所述M幅图像中随机的m幅图像,或者,所述m幅图像是所述M幅图像中,相对于各自在视频流中的前一幅图像变化最大的m幅图像,m为大于或等于1的整数;
或者,检测所述M幅图像中图像内容随时间变化的像素位置,计算所述M幅图像中对应识别出的像素位置处的彩色信息的平均值,获得所述识别出的像素位置处的新的彩色信息,根据所述识别出的像素位置处的新的彩色信息生成所述第一视频段对应的彩色信息图像。
在一种可能的实现方案中,所述获取M幅图像之前,所述方法还包括:
确定所述视频流中的一个预设时间长度的时间窗,所述时间窗的结束时刻处于所述第一视频段对应的时间段内;根据所述时间窗内的最后一幅图像以及至少一幅参考图像,判断所述第一视频段中是否有动作发生,所述至少一幅参考图像是所述时间窗内除了所述最后一幅图像之外的其它任意一幅图像;若判断结果为所述第一视频段中有动作发生,则执行所述获取M幅图像的步骤。
由于手势操作必然会涉及到手势动作,因此,通过上述可能的实现方案,在对视频段进行手势识别之前,首选通过视频段内的图像与该图像之前的至少一幅图像来判断该视频段内是否有动作发生,并在判断出有动作发生时,才执行后续的识别操作,从而减少了不必要的识别步骤,节约计算资源,同时提高识别效率。
在一种可能的实现方案中,所述根据所述时间窗内的最后一幅图像以及所述至少一幅参考图像,判断所述第一视频段中是否有动作发生,包括:
针对所述至少一幅参考图像中的每一幅参考图像,计算所述最后一幅图像的偏导图像,所述偏导图像中的每个像素的值,是所述最后一幅图像中对应像素的值相对于所述参考图像中对应像素的值的偏导;对所述偏导图像中的各个像素的值进行归一化处理,获得归一化之后的偏导图像;根据预设的二值化阈值,对所述归一化之后的偏导图像进行二值化处理,获得所述偏导图像的二值化图像,所述二值化图像中的各个像素的值为0或1;计算所述二值化图像中各个像素的灰度值之和;当所述灰度值之和大于0时,确定所述第一视频段中有动作发生。
在一种可能的实现方案中,所述对所述第一识别结果和所述第二识别结果进行融合,获得所述第一视频段的手势识别结果,包括:
对所述第一识别结果和所述第二识别结果进行平均值计算,根据所述平均值计算的计算结果获得所述第一视频段的手势识别结果;或者,将所述第一识别结果和所述第二识别结果输入预先训练的第二机器学习模型,以获得所述第一视频段的手势识别结果。
第二方面,提供了一种手势识别装置,该装置具有实现上述第一方面及第一方面的可能的实现方案所提供的手势识别方法的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多于一个与上述功能相对应的单元。
第三方面,提供了一种手势识别设备,该设备包括:处理器和存储器;该设备中的处理器,通过执行存储器中存储的程序或指令以实现上述第一方面及第一方面的可能的实现方案所提供的手势识别方法。
第四方面,提供了一种计算机可读存储介质,该计算机可读存储介质存储有可执行程序,该可执行程序由处理器执行以实现上述第一方面及第一方面的可能的实现方案所提供的手势识别方法。
附图说明
图1是本申请涉及的一种手势识别系统的架构图;
图2是图1所示实施例涉及的一种手势识别示意图;
图3是本申请一个示例性实施例提供的手势识别方法的方法流程图;
图4是图3所示实施例涉及的两种时间窗跨度示意图;
图5是图3所示实施例涉及的一种通过识别结果融合示意图;
图6是图3所示实施例涉及的一种手势识别的流程示意图;
图7是本申请一个示例性实施例提供的一种手势识别设备的结构示意图;
图8是本申请一个示例性实施例提供的一种手势识别装置的结构方框图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
图1是本申请实施例涉及的一种手势识别系统的系统架构图。该手势识别系统可以包括以下设备:图像采集设备110以及手势识别设备120。
图像采集设备110可以是摄像头。比如,该图像采集设备110可以是单个摄像头,或者,该图像采集设备110也可以是由两个或两个以上的摄像头组成的摄像模组。
图像采集设备110可以固定设置,或者,图像采集设备110也可以集成有伺服电机,该伺服电机可以在手势识别设备的控制下,带动图像采集设备110转动或移动,以改变图像采集设备110的拍摄角度或拍摄位置。
手势识别设备120可以是通用计算机,或者,手势识别设备也可以是嵌入式计算设备。
其中,图像采集设备110和手势识别设备120可以是相互独立的设备,且图像采集设 备110和手势识别设备120之间通过有线或者无线网络相连。
或者,图像采集设备110和手势识别设备120也可以集成在同一个实体设备中,且图像采集设备110和手势识别设备120之间用过通信总线相连。
可选的,手势识别设备120在识别出视频流中的手势之后,将识别出的手势传输给控制设备130,由控制设备130根据识别出的手势确定相应的控制指令,根据确定出的控制指令执行相应的控制操作,比如,根据控制指令控制图形显示,或者,根据控制指令控制被控设备执行某项操作等等。
在本申请实施例中,图像采集设备110将采集到的视频流传输给手势识别设备120,由手势识别设备120对视频流进行图像分析和手势识别,以即时识别视频流中的手势。请参考图2,其示出了本申请实施例涉及的一种手势识别示意图。如图2所示,在进行手势识别时,手势识别设备120可以从视频流的一个视频段中提取M幅图像(其中,M为大于或等于2的整数),手势识别设备120,通过深度学习算法对该M幅图像进行手势识别,获得该视频段对应的手势识别结果,在获得该视频流中包含该视频段在内的连续N个视频段的手势识别结果后,对该连续N个视频段的手势识别结果进行结果融合,获得融合后的手势识别结果。
即在本申请实施例所示的方案中,将一个完整的手势动作划分为多个阶段动作,通过深度学习算法来识别每一个阶段动作,最后将识别出的各个阶段动作融合为完整的手势动作,在识别过程中,不需要对视频流中的手势进行分割和跟踪,而是通过计算速度较快的深度学习算法来识别各个阶段动作,从而达到提高手势识别的速度,降低手势识别的延迟的效果。
以上述深度学习算法是基于光流信息和彩色信息的双通道深度学习算法为例,请参考图3,其是本申请一个示例性实施例提供的手势识别方法的方法流程图。如图3所示,该手势识别方法可以包括如下步骤:
步骤301,对于视频流中的第一视频段,确定视频流中的一个预设时间长度的时间窗,该时间窗的结束时刻处于该第一视频段对应的时间段内。
第一视频段是该视频流中任意一个视频段,在本申请实施例中,手势识别设备可以将视频流划分为首尾相连的若干个视频段,并针对每个视频段分别进行手势识别。
视频流由一系列对应不同时间点的视频图像组成,在本申请实施例中,上述时间窗可以是两个不同的视频图像对应的时间点之间的时间窗,即视频流在该时间窗内的第一幅图像对应的时间点与该时间窗内的最后一幅图像对应的时间点之间的时间长度为上述预设时间长度。
上述时间窗内的最后一幅图像是待识别的第一视频段中的一幅图像,而视频流在该时间窗内的其它图像可以是该第一视频段内的图像,也可以是该视频流中处于该第一视频段之前的图像。
比如,请参考图4,其示出了本申请实施例涉及的两种时间窗跨度示意图,在图4中,时间窗的起始时刻为t1,结束时刻为t2,而第一视频段的起始时刻为t3,结束时刻为t4
如图4(a)所示,在一种可能的实现方式中,t1和t2处于t3和t4之间,即上述时间窗完全处于上述第一视频段之内。
如图4(b)所示,在另一种可能的实现方式中,t2处于t3和t4之间,而t1处于t3之前,即上述时间窗部分处于上述第一视频段之内,另一部分处于上述第一视频段之前。
此外,上述预设时间长度可以由为开发人员预先设置在手势识别设备中。
步骤302,根据该时间窗内的最后一幅图像以及至少一幅参考图像,判断该第一视频段中是否有动作发生;若是,进入步骤303,否则,返回步骤301,确定下一个预设时间长度的时间窗。
该至少一幅参考图像是该时间窗内除了最后一幅图像之外的其它任意一幅图像。
在本申请实施例中,手势识别设备根据视频流在该时间窗内的最后一幅图像,以及视频流在该时间窗内的其它至少一幅图像之间的差异,来判断该第一视频段中是否有动作发生。
在实际应用中,上述根据该时间窗内的最后一幅图像,以及该时间窗内其它至少一幅图像,判断该第一视频段中是否有动作发生的步骤可以分为如下几个子步骤:
步骤302a,针对该至少一幅参考图像中的每一幅参考图像,计算该最后一幅图像的偏导图像,该偏导图像中的每个像素的值,是该最后一幅图像中对应像素的值相对于该参考图像中对应像素的值的偏导。
在本申请实施例,可以定义输入的视频流中的图像为f(x,y,t),其中x是图像的水平分量,y是图像的竖直分量;t代表时间t=1,2,…,t0,…。输入的视频流的两帧图像f(x,y,t0),f(x,y,t0-q),对于相邻两帧图像,q=1。
定义:在时间t0的一帧图像为f(x,y,t0),其中,t0时刻的图像为上述时间窗中的最后一幅图像,则其前q时刻的图像为f(x,y,t0-q),手势识别设备计算视频流关于时间t在t0时刻相对于t0-q时刻的偏导:
Figure PCTCN2017095388-appb-000001
步骤302b,对该偏导图像中的各个像素的值进行归一化处理,获得归一化之后的偏导图像。
手势识别设备可以将g(x,y,t0)归一化至范围[a,b],例如,归一化范围选择[a,b]=[0,1]。即手势识别设备对g(x,y,t0)中的每个像素的值分别归一化至[0,1]区间内的某个值。
步骤302c,根据预设的二值化阈值,对该归一化之后的偏导图像进行二值化处理,获得该偏导图像的二值化图像,该二值化图像中的各个像素的值为0或1。
在本申请实施例中,在获得归一化之后的偏导图像后,可以根据归一化之后的偏导图像中每个像素的值与预设的二值化阈值之间的大小关系,对归一化之后的偏导图像进行二值化处理,将归一化之后的偏导图像中每个像素的值二值化为0或者1,其二值化的公式如下:
Figure PCTCN2017095388-appb-000002
在上述公式2中,Z为预设的二值化阈值,对于归一化之后的偏导图像gb(x,y,t0)中的像素的值,当该像素的值大于Z时,将该像素的值二值化为1,当该像素的值小于或者等于T时,将该像素的值二值化为0。
其中,上述预设的二值化阈值为预先设置的,处于(0,1)之间的某一个数值,比如,该预设的二值化阈值可以为0.5,或者,该预设的二值化阈值也可以为0.4或者0.6等等。 该二值化阈值可以由开发人员根据实际处理效果预先设定。
步骤302d,计算该二值化图像中各个像素的灰度值之和。
步骤302e,当该灰度值之和大于0时,确定该第一视频段中有动作发生。
在本申请实施例中,手势识别设备在获得二值化图像gb(x,y,t0)之后,计算gb(x,y,t0)灰度值总和Sum(t0),当总和Sum(t0)大于0,即可以确定该第一视频段中有动作发生。否则认为该第一视频段中“无动作”。其公式如下:
Sum(t0)=∑(x,y)gb(x,y,t0)  ⑶
if Sum(t0)>0,则判断有动作发生,进入步骤303;if Sum(t0)≤0,则判断没有动作发生,返回步骤301。
步骤303,获取M幅图像,该M幅图像是从该第一视频段中提取出的M幅图像。
当上述步骤302中判断出第一视频段内有动作发生时,手势识别设备可以从该第一视频段中提取出M幅图像,M为大于或等于2的整数。
在实际应用中,手势识别设备可以提取出该第一视频段中的每一幅图像,获得该M幅图像。或者,手势识别设备也可以在第一视频段中每隔一幅或多幅图像提取出一幅图像,以获得M幅图像。
步骤304,对该M幅图像进行图像处理,获得该第一视频段对应的光流信息图像。
上述光流信息图像包含M幅图像中的第一图像与该第一图像之前的第p幅图像之间的光流信息,该第一图像是该M幅图像中的任意一幅,该光流信息包含图像中的像素点的瞬时速度矢量信息,p为大于或等于1的整数。
其中,光流是空间运动物体在观察成像平面上的像素运动的瞬时速度,手势识别设备可以利用图像序列中像素在时间域上的变化以及相邻帧之间的相关性来找到之前一幅图像与当前图像之间存在的对应关系,从而计算出前后两幅图像之间物体的运动信息,该计算出的前后两幅图像之间物体的运动信息就是这两幅图像之间的光流信息。上述计算前后两幅图像之间物体的运动信息的方法称为光流法。其中,光流信息也称为光流场(optical flow field),是指图像灰度模式的表观运动,其是一个二维矢量场,它包含的信息即是各像点的瞬时运动速度矢量信息,因此,光流信息可以表现为一个与元图像大小相同的双通道图像。
在提取光流信息图像时,手势识别设备可以利用第一视频段内的RGB图像序列,获得一个光流信息图像(不论第一视频段内包含多少个帧)。在本申请实施例中,获得该第一视频段对应的光流信息图像的方式可以由如下两种:
一、对于该M幅图像中的第一图像,按预设规则获取该视频流中处于该第一图像之前的第p幅图像;计算第一图像与该第p幅图像之间的光流信息,并生成包含所述第一图像与所述第p幅图像之间的光流信息的光流信息图像。
其中,该第一图像与该第p幅图像之间的时间间隔不小于第一深度学习算法的前向计算时间以及计算光流信息图像所需的时间。其中,该第一深度学习算法是手势识别设备后续根据光流信息图像识别手势所使用的算法。
其中,上述预设规则可以是开发人员或者用户自行设置的规则,比如,开发人员或者用户可以人工设置上述p的数值。或者,手势识别设备也可以按照预设规则,根据设备的处理性能自行设置上述p的数值;比如,手势识别设备可以预先运行一次第一深度学习算法的前向计算以及光流信息图像的计算,并记录该前向计算时间和计算光流信息图像的时 间,并根据前向计算时间、计算光流信息图像的时间以及视频流的帧率(即每秒钟的视频中包含多少幅图像)来设置p的数值,具体比如,手势识别设备可以确定上述前向计算时间和计算光流信息图像的时间中的较大值对应在视频流中的图像数量,并将确定出的图像数量对应的数值设置为p的数值。
对于实时视频,假设T是某一幅图像与该幅图像之前的第p幅图像之间的时间间隔,则T的最小值可以是手势识别设备通过光流信息图像进行手势识别的深度学习网络前向计算所需要的时间和手势识别设备计算光流信息图像所需的时间这两者中的较大值。
假设OF[·]代表光流算法,在一种可能的实现方式中,手势识别设备可以通过欧拉光流场(Eulerian motion field)算法,根据p幅图像中的一幅图像It(x,y),以及It(x,y)之前的第p幅图像It-T(x,y),直接计算获得该M幅图像中的一幅图像对应的光流信息,并生成包含计算出的光流信息的光流信息图像,并将该光流信息图像作为该第一视频段对应的光流信息图像。其计算公式可以简单表示如下:
Ut(x,y)=OF[It-T(x,y),It(x,y)]    ⑷
其中,在上述公式(4)中,Ut(x,y)为图像It(x,y)对应的光流信息图像。OF[·]代表上述欧拉光流场算法。
二、对于该M幅图像中的第一图像,按预设规则获取该视频流中处于第一图像之前的全部p幅图像;计算第一图像以及该p幅图像中每相邻两幅图像之间的光流信息,将每相邻两幅图像之间的光流信息进行累加后,生成包含累加后的光流信息的光流信息图像。
在另一种可能的实现方式中,手势识别设备可以通过拉格朗日光流场(Lagrangian motion field)算法,计算M幅图像中的一幅图像It(x,y),以及It(x,y)之前的p幅图像It-1(x,y),It-2(x,y),……,It-T(x,y)中的每两个临近图像之间的光流信息,然后累加每两个临近图像之间的光流信息,生成包含累加后的光流信息的图像Ut(x,y)。其中,累加光流信息的过程中涉及到缺失数据插补,可以选线性(linear),双线性(bilinear),三次曲线(cubic)等插补方式。
其计算公式可以简单表示如下:
Figure PCTCN2017095388-appb-000003
其中,在上述公式(4)中,Ut(x,y)为图像It(x,y)对应的光流信息图像。OF[·]代表上述拉格朗日光流场算法。
对于方式一,只需要计算一次光流场,因此需要选择较准确的光流场算法。对于方式二,需要计算多次光流场,允许使用准确性低但速度快的光流场算法。
步骤305,对该M幅图像进行图像处理,获得该第一视频段对应的彩色信息图像。
其中,上述彩色信息图像包含该M幅图像的彩色信息。其中,彩色信息包含图像中的各个像素点的彩色色值。
手势识别设备处理第一视频段内的图像序列,输出m幅彩色信息图像,比如RGB(red green blue,红绿蓝)图像,来代表该第一视频段对应的彩色信息图像,m为大于或等于1的整数。假设输入第一视频段内的所有图像It-T(x,y),It-T+1(x,y),…,It-1(x,y),It(x,y),而输出的则是在该视频段的结束时刻,用m幅图像
Figure PCTCN2017095388-appb-000004
来代表该第一视频段内的图像的彩色信息。
其中,在对该M幅图像进行图像处理,获得该第一视频段对应的彩色信息图像时,手势识别设备可以通过以下方法获得彩色信息图像:
1)提取该M幅图像中的m幅图像的彩色信息,根据提取到的彩色信息生成该m幅图像各自对应的彩色信息图像,将该m幅图像各自对应的彩色信息图像获取为该第一视频段对应的彩色信息图像。比如,手势识别设备提取到m幅图像中的任意一幅图像的彩色信息后,即生成对应于该任意一幅图像的彩色信息图像,且生成的该彩色信息图像包含该任意一幅图像的彩色信息。
其中,该m幅图像是该M幅图像中随机的m幅图像。比如,以获取单幅彩色信息图像为例,在本申请实施例中,当第一视频段的时间长度比较小时,可以直接从第一视频段内随机选择一幅图像对应的彩色信息图像来表示
Figure PCTCN2017095388-appb-000005
即:
Figure PCTCN2017095388-appb-000006
其中,t-T为第一视频段内的第一幅图像对应的时间点,t为第一视频段内的最后一幅图像对应的时间点。
可选的,除了随机选择m幅图像的彩色信息图像作为第一视频段对应的彩色信息图像之外,手势识别设备还可以通过其它策略选择出m幅图像的彩色信息图像作为第一视频段对应的彩色信息图像。比如,手势识别设备可以将上述M幅图像中,对应时间处于最前或最后的m幅图像的彩色信息图像作为第一视频段对应的彩色信息图像。
在另一种可能的实现方式中,该m幅图像可以是该M幅图像中,相对于各自在视频流中的前一幅图像变化最大的m幅图像。
比如,针对M幅图像中的每一幅图像,手势识别设备可以检测该图像中,与视频流中处于该图像之前的一幅图像相比发生变化的像素;手势识别设备可以将该M幅图像中,相对于各自的前一幅图像发生变化的像素数量最多的m幅图像对应的彩色信息图像获取为该第一视频段对应的彩色信息图像。
2)检测该M幅图像中图像内容随时间变化的像素位置,计算该M幅图像中对应识别出的像素位置处的彩色信息的平均值,获得该识别出的像素位置处的新的彩色信息,根据识别出的像素位置处的新的彩色信息生成该第一视频段对应的彩色信息图像。
在本申请实施例中,手势识别设备还可以将该M幅图像中相同位置的像素点进行比对,以识别出该M幅图像中图像内容随时间变化而改变的像素位置(像素位置可以是像素点在图像中的坐标),并对该M幅图像中对应识别出的像素位置处的像素点的彩色信息取平均值,获得对应识别出的像素位置处的新的彩色信息,并生成新的彩色信息图像,其中,新的彩色信息图像中,对应上述识别出的像素位置处的彩色信息为上述取平均值获得的新的彩色信息。
其中,上述检测图像中与前一幅图像相比发生变化的像素的算法,以及检测图像中随时间变化而改变的像素位置的算法,可以统称为时空显著性图像检测算法。
3)提取该M幅图像中的全部或部分图像的彩色信息,获得该全部或部分图像各自对应的彩色信息图像,计算该全部或部分图像各自对应的彩色信息图像中,各个像素处的彩色信息的平均值,获得该第一视频段对应的彩色信息图像。
比如,以上述M幅图像为是频段内的全部图像,且获取单幅彩色信息图像为例,将第一视频段内的各个图像的彩色信息的平均值作为第一视频段的彩色信息图像的计算公式可以如下:
Figure PCTCN2017095388-appb-000007
其中,t-T为第一视频段内的第一幅图像对应的时间点,t为第一视频段内的最后一幅图像对应的时间点;n为第一视频段内的图像的数量。
4)提取该M幅图像中的全部或部分图像的彩色信息,生成该全部或部分图像各自对应的彩色信息图像,计算该全部或部分图像各自对应的彩色信息图像中,各个像素处的彩色信息的平均值,再将该全部或部分图像各自对应的彩色信息图像中的各个像素的彩色信息减去上述计算出的各个像素处的彩色信息后,将获得的彩色信息图像作为该第一视频段对应的彩色信息图像。
由于在上述M幅图像中,发生变化的像素通常是图像中的前景部分(即对应人手的部分),而背景部分对应的像素通常是不变的,因此,在上述全部或部分图像各自对应的彩色信息图像中,对应背景部分的像素的彩色信息与该处的彩色信息的平均值通常是相同或相近的,而对应背景部分的像素的彩色信息与该处的彩色信息的平均值通常差别较大,因此,在本申请实施例中,还可以将全部或部分图像各自对应的彩色信息图像中的各个像素的彩色信息减去对应像素位置的平均值,可以获得上述全部或部分图像各自对应的,去除背景部分后的彩色信息图像,手势识别设备可以将全部或部分图像各自对应的,去除背景部分后的彩色信息图像作为第一视频段对应的彩色信息图像。
步骤306,通过第一深度学习算法对该光流信息图像进行手势识别,获得第一识别结果,并通过第二深度学习算法对该彩色信息图像进行手势识别,获得第二识别结果。
在本申请实施例中,可以根据输入的视频流,在前序步骤获得彩色信息图像(例如RGB图像)和光流信息图像,在此步骤306分别用两个深度学习模型进行手势识别,并将两个深度学习模型识别的结果在下一个步骤进行融合。
本申请实施例使用了双通道深度学习模型来做手势识别,其中一个通道是Temporal stream(对应上述第一深度学习算法),其输入的是光流信息图像,最后输出对当前光流信息图像的手势识别结果;比如,在上述步骤304中,对于M幅图像中的每一幅图像,手势识别设备获取到该幅图像的光流信息图像之后,即缓存该光流信息图像,在该光流信息图像进行手势识别时,手势识别设备将最近存储的X个光流信息图像输入深度学习通道Temporal stream,以输出该X个光流信息图像对应的手势识别结果,并将输出的该手势识别结果作为对该第一视频段的光流信息图像进行手势识别的结果。
上述双通道深度学习模型中,另外一个通道为Spatial stream(对应上述第二深度学习算法),其输入的是步骤305中获得的,表示第一视频段中的至少一幅彩色信息图像,输出是对该至少一幅彩色信息图像的手势识别结果。
其中,上述双通道深度学习模型是预先训练好的机器学习模型。
步骤307,对该第一识别结果和该第二识别结果进行融合,获得该第一视频段的手势识别结果。
本申请实施例中,由于上述步骤306中获得光流信息图像的手势识别结果和彩色信息图像包含的手势识别结果,是对同一段视频段的手势识别结果,因此,手势识别设备在获取到光流信息图像的手势识别结果和彩色信息图像包含的手势识别结果后,可以对这两个结果进行融合,以获得第一视频段的手势识别结果。
其中,对第一识别结果和第二识别结果进行融合的方式可以有两种:
一种是对该第一识别结果和该第二识别结果进行平均值计算,根据该平均值计算的计 算结果获得该第一视频段的手势识别结果。
另一种是将该第一识别结果和该第二识别结果输入预先训练的第二机器学习模型,比如线性支持向量机(Support Vector Machine,SVM)模型,以获得该第一视频段的手势识别结果。其中,上述第二机器学习模型是用于根据输入的两个识别结果确定出单个识别结果的学习模型,该第二机器学习模型可以通过预先标注好手势的视频段进行训练获得。具体比如,上述两个识别结果可以是两个数值,手势识别设备可以将两个数值输入第二机器学习模型,第二机器学习模型按照预先训练好的计算公式以及输入的两个数值计算出一个融合后的数值,并将融合后的数值输出为第一视频段的手势识别结果。
对于每个视频段,手势识别设备实时获得该视频段对应的一个阶段性的手势识别结果,并存入临时手势识别结果库。
步骤308,在获得该视频流中包含上述第一视频段在内的连续N个视频段的手势识别结果后,对该连续N个视频段的手势识别结果进行结果融合,获得融合后的手势识别结果。
其中,N≥2,且N为整数。
在本申请实施例中,在对连续N个视频段的手势识别结果进行结果融合时,手势识别设备可以将该连续N个视频段的手势识别结果输入预先训练的第一机器学习模型,获得该融合后的手势识别结果,该第一机器学习模型用于确定输入的连续N个手势识别结果所构成的整体手势运动趋势,并将该整体手势运动趋势对应的手势输出为融合后的手势识别结果。具体比如,连续N个手势识别结果可以是N个数值,手势识别设备可以将N个数值按照N个视频段的时间顺序输入第一机器学习模型,第一机器学习模型按照预先训练好的计算公式以及先后输入的N个数值计算出一个融合后的数值,并将融合后的数值输出为融合后的手势识别结果。
其中,该第一机器学习模型为神经网络模型,且该神经网络模型的神经元数量为N;或者,该第一机器学习模型为支持向量机SVM模型。
或者,在本申请实施例中,对该连续N个视频段的手势识别结果进行结果融合时,手势识别设备可以获取预先设置的,该连续N个视频段的手势识别结果各自对应的权重系数;根据该连续N个视频段的手势识别结果各自对应的权重系数,对该连续N个视频段的手势识别结果进行加权平均,获得该融合后的手势识别结果。
在实际应用中,用户在执行单个手势操作的过程中,总体上的手势运动趋势符合用户想要做出的手势动作,但是可能会有一小段时间内,其手势不符合用户想要做出的手势动作。比如,以用户想要做出的手势操作为向上抬手为例,用户在1s内做出向上抬手的手势操作,但是在这1s内的某一个很短的时间段(比如0.2s)内,用户并没有向上抬手,而是微微向下压手,而在这很短的时间段之后,用户又继续向上抬手,此时,手势识别设备识别出的上述很短的时间段内的手势识别结果,并不符合用户当前想要执行的手势操作。因此,在本申请实施例中,为了提高手势识别的准确性,手势识别设备可以对连续多个视频段的手势识别结果(即一个手势识别结果的序列)进行融合,将多个视频段的手势识别结果所反映出的整体手势运动趋势作为融合后的手势识别结果。
具体的,定义在实时连续视频流中,在N*T1时间(T1为一个视频段的时间长度)里,手势识别设备计算了N次动作识别阶段性结果,利用这N次识别阶段性结果的融合决策(经过N*T1时间)而给出最终一个识别结果。根据用户做出一个手势动作的平均时间长度,这 里N*T1可以取值为1秒左右。进行N次识别阶段性结果的融合有多种实现方式,例如下面两种方式:
(1)线性组合:
Result=α1r12r2+……+αNrN  ⑻
这里,r1,r2,…,rN是阶段性识别结果,每个结果之前的权重系数是α12,…,αN,这些权重系数可以是预先通过机器学习算法确定的系数,不同的系数组合将产生不同的组合方式。
(2)支持向量机方法SVM:
请参考图5,其示出了本申请实施例涉及的一种通过识别结果融合示意图。如图5所示,在本申请实施例中,可以把阶段性识别结果r1,r2,…,rN输入到预先训练好的机器学习模型,即图5所示的SVM模块(SVM模块中的SVM核函数是预先设置或者训练出的核函数)中,输出融合结果。
在获得融合后的手势识别结果后,手势识别设备可以根据手势识别结果调用相应的模块(例如:幻灯片演示、图片全屏播放等)达到人机互动的目的。
在本申请实施例中,如果判断视频段中没有动作发生,则手势识别设备可不对该视频段进行手势识别,以减少手势识别的频率,避免不必要的识别过程。具体的,手势识别设备可以将该视频段内的手势识别结果直接设置为空,或者,不设置该视频段的手势识别结果。
请参考图6,其示出了本申请实施例涉及的一种手势识别的流程示意图。以该流程用于图1所示的手势识别系统为例,如图6所示,图像采集设备将采集到的视频流输入手势识别设备,手势识别设备提取到视频流中的一幅图像后,通过上述步骤302所示的方法,根据该幅图像以及该幅图像之前一段时间内的至少一幅图像,判断该幅图像相对于之前的至少一幅图像是否有动作发生,若判断出有动作发生,则手势识别设备针对视频流中当前图像所在的一个视频段内的各幅图像(或者,也可以针对其中部分图像),按照步骤304和步骤305的方法分别提取该视频段的光流信息图像和彩色信息图像,并按照步骤306所示的方法对光流信息图像和彩色信息图像分别进行手势识别,再按照步骤307所示的方法将对光流信息图像和彩色信息图像分别进行手势识别获得的手势识别结果进行融合,获得该视频段对应的阶段性手势识别结果。当连续获得N个阶段性手势识别结果后,手势识别设备通过步骤308所示的方法对该N个手势识别结果进行融合,获得一个融合后的手势识别结果。
可选的,上述涉及到的机器学习模型(包括上述第一机器学习模型、第二机器学习模型以及步骤306中的双通道深度学习模型等),可以通过预先标注好对应的手势的视频样本进行机器训练来获得。
上述机器训练的过程可以由模型训练设备来实现,具体比如,以上述第一机器学习模型、第二机器学习模型以及双通道深度学习模型都通过机器训练获得为例,在一种可能的实现方式中,开发人员可以向模型训练设备中输入若干个视频流样本,每个视频流样本中包含一个手势,且开发人员预先标注好每个视频流样本中的手势,并且,开发人员将每个视频流划分为多个视频段,并标注每个视频段对应的阶段性手势。在进行机器训练时,模型训练设备按照步骤304和步骤305所示的方案,对每个视频段提取光流信息图像和彩色 信息图像,并将视频段的光流信息图像和彩色信息图像输入双通道深度学习模型,将双通道深度学习模型输出的两个识别结果,以及该视频段已标注的阶段性手势,输入第二机器学习模型,以对双通道深度学习模型和第二机器学习模型进行模型训练。此外,对于每一个视频流样本,模型训练设备将该视频流样本中的各个视频段的阶段性手势以及预先标注的该视频流样本的手势输入第一机器学习模型进行机器训练,以获得该第一机器学习模型。
再比如,以上述第一机器学习模型、第二机器学习模型以及双通道深度学习模型都通过机器训练获得为例,在另一种可能的实现方式中,开发人员可以向模型训练设备中输入若干个视频流样本,每个视频流样本中包含一个手势,且开发人员预先标注好每个视频流样本中的手势,模型训练设备将视频流划分为多个视频段,并对每个视频段提取光流信息图像和彩色信息图像,并将视频段的光流信息图像和彩色信息图像输入双通道深度学习模型,将双通道深度学习模型输出的两个识别结果输入第二机器学习模型,再将第二机器学习模型输出的,对多个视频段的阶段性手势识别结果输入第一机器学习模型,同时,模型训练设备将该视频流对应的已标注的手势输入该第一机器学习模型,以同时对第一机器学习模型、第二机器学习模型以及双通道深度学习模型进行机器训练。
需要说明的是,本发明实施例所示的方法以上述双通道深度学习模型为例进行说明,在实际应用中,手势识别设备在对每一个视频段进行识别时,可以通过其它深度学习算法识别单个视频段的手势,比如,手势识别设备可以只通过光流信息图像识别视频段对应的手势识别结果,或者,手势识别设备也可以只通过彩色信息图像识别视频段对应的手势识别结果,对于上述用于识别视频段的手势识别结果的深度学习算法,本发明实施例不做限定。
综上所述,本申请实施例所示的方法,针对视频流中的每一个视频段,对该视频段分别提取光流信息图像和彩色信息图像,并通过深度学习算法对光流信息图像和彩色信息图像分别进行手势识别,在手势识别之后,在对两种图像分别对应的手势识别结果进行融合,以确定该视频段对应的手势识别结果,最后将该视频段在内的连续N个视频段的手势识别结果进行融合,获得对该连续N个视频段的的手势识别结果,即在上述方法中,将一个完整的手势动作划分为多个阶段动作,通过深度学习算法来识别每一个阶段动作,最后将识别出的各个阶段动作融合为完整的手势动作,在识别过程中,不需要对视频流中的手势进行分割和跟踪,而是通过计算速度较快的深度学习算法来识别各个阶段动作,从而达到提高手势识别的速度,降低手势识别的延迟的效果。
图7是本申请一个示例性实施例提供的手势识别设备70的结构示意图,该手势识别设备70可以实现为图1所示的系统中的手势识别设备120。如图7所示,该手势识别设备70可以包括:处理器71以及存储器73。
处理器71可以包括一个或者一个以上处理单元,该处理单元可以是中央处理单元(英文:central processing unit,CPU)或者网络处理器(英文:network processor,NP)等。
可选的,该手势识别设备70还可以包括存储器73。存储器73可用于存储软件程序,该软件程序可以由处理器71执行。此外,该存储器73中还可以存储各类业务数据或者用户数据。该软件程序可以包括图像获取模块、识别模块以及融合模块;可选的,该软件程序还可以包括时间窗确定模块以及判断模块;
其中,图像获取模块由处理器71执行,以实现上述图3所示实施例中有关获取视频流的第一视频段中提取出的M幅图像的功能。
识别模块由处理器71执行,以实现上述图3所示实施例中有关识别第一视频段对应的手势识别结果的功能。
融合模块由处理器71执行,以实现上述图3所示实施例中有关对连续N个视频段的手势识别结果进行融合的功能。
时间窗确定模块由处理器71执行,以实现上述图3所示实施例中有关确定时间窗的功能。
判断模块由处理器71执行,以实现上述图3所示实施例中有关判断第一视频段中是否有动作发生的功能。
可选的,该手势识别设备70还可以包括通信接口74,该通信接口74可以包括网络接口。其中,该网络接口用于连接图像采集设备。具体的,该网络接口可以包括有线网络接口,比如以太网接口或者光纤接口,或者,网络接口也可以包括无线网络接口,比如无线局域网接口或者蜂窝移动网络接口。手势识别设备70通过该网络接口74与其它设备进行通信。
可选的,处理器71可以用总线与存储器73和通信接口74相连。
可选地,该手势识别设备70还可以包括输出设备75以及输入设备77。输出设备75和输入设备77与处理器71相连。输出设备75可以是用于显示信息的显示器、播放声音的功放设备或者打印机等,输出设备75还可以包括输出控制器,用以提供输出到显示屏、功放设备或者打印机。输入设备77可以是用于用户输入信息的诸如鼠标、键盘、电子触控笔或者触控面板之类的设备,输入设备77还可以包括输出控制器以用于接收和处理来自鼠标、键盘、电子触控笔或者触控面板等设备的输入。
下述为本申请的装置实施例,可以用于执行本申请的方法实施例。对于本申请的装置实施例中未披露的细节,请参照本申请的方法实施例。
图8是本申请一个示例性实施例提供的一种手势识别装置的结构方框图,该手势识别装置可以通过硬件电路或者软件硬件的结合实现成为手势识别设备的部分或者全部,该手势识别设备可以是上述图1所示的实施例中的手势识别设备120。该手势识别装置可以包括:图像获取单元801、识别单元802以及融合单元803;可选的,该软件程序还可以包括时间窗确定单元804以及判断单元805。
其中,图像获取单元801由处理器执行,以实现上述图3所示实施例中有关获取视频流的第一视频段中提取出的M幅图像的功能。
识别单元802由处理器执行,以实现上述图3所示实施例中有关获得第一视频段对应手势识别结果的功能。
融合单元803由处理器执行,以实现上述图3所示实施例中有关对连续N个视频段的手势识别结果进行融合的功能。
时间窗确定单元804由处理器执行,以实现上述图3所示实施例中有关确定时间窗的功能。
判断单元805由处理器执行,以实现上述图3所示实施例中有关判断第一视频段中是 否有动作发生的功能。
需要说明的是:上述实施例提供的手势识别装置在进行手势识别时,仅以上述各功能单元的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元完成,即将设备的内部结构划分成不同的功能单元,以完成以上描述的全部或者部分功能。另外,上述实施例提供的手势识别装置与手势识别方法的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
上述本申请的实施例序号仅仅为了描述,不代表实施例的优劣。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (30)

  1. 一种手势识别设备,其特征在于,所述设备包括:处理器和存储器;
    所述处理器,用于获取M幅图像,所述M幅图像是从视频流中的第一视频段中提取出的,其中,所述第一视频段是所述视频流中任意一个视频段,M为大于或等于2的整数;
    所述处理器,用于通过深度学习算法对所述M幅图像进行手势识别,获得所述第一视频段对应的手势识别结果;
    所述处理器,用于在获得所述视频流中包含所述第一视频段在内的连续N个视频段的手势识别结果后,对所述连续N个视频段的手势识别结果进行结果融合,获得融合后的手势识别结果,N为大于或等于2的整数。
  2. 根据权利要求1所述的设备,其特征在于,在对所述连续N个视频段的手势识别结果进行结果融合,获得融合后的手势识别结果时,所述处理器,具体用于:
    将所述连续N个视频段的手势识别结果输入预先训练的第一机器学习模型,获得所述融合后的手势识别结果,所述第一机器学习模型用于确定输入的连续N个手势识别结果所构成的整体手势运动趋势,并将所述整体手势运动趋势对应的手势输出为所述融合后的手势识别结果。
  3. 根据权利要求2所述的设备,其特征在于,
    所述第一机器学习模型为神经网络模型,且所述神经网络模型的神经元数量为N;
    或者,
    所述第一机器学习模型为支持向量机SVM模型。
  4. 根据权利要求1所述的设备,其特征在于,在对所述连续N个视频段的手势识别结果进行结果融合,获得融合后的手势识别结果时,所述处理器,具体用于:
    获取预先设置的,所述连续N个视频段的手势识别结果各自对应的权重系数;
    根据所述连续N个视频段的手势识别结果各自对应的权重系数,对所述连续N个视频段的手势识别结果进行加权平均,获得所述融合后的手势识别结果。
  5. 根据权利要求1所述的设备,其特征在于,在通过深度学习算法对所述M幅图像进行手势识别,获得所述第一视频段对应的手势识别结果时,所述处理器,具体用于:
    对所述M幅图像进行图像处理,获得所述第一视频段对应的光流信息图像,所述光流信息图像包含所述M幅图像中的第一图像与所述第一图像之前的第p幅图像之间的光流信息,所述第一图像是所述M幅图像中的任意一幅,所述光流信息包含图像中的像素点的瞬时速度矢量信息,并通过第一深度学习算法对所述光流信息图像进行手势识别,获得第一识别结果,p为大于或等于1的整数;
    对所述M幅图像进行图像处理,获得所述第一视频段对应的彩色信息图像,所述彩色信息图像包含所述M幅图像的彩色信息,所述彩色信息包含图像中的各个像素点的色值,并通过第二深度学习算法对所述彩色信息图像进行手势识别,获得第二识别结果;
    对所述第一识别结果和所述第二识别结果进行融合,获得所述第一视频段的手势识别结果。
  6. 根据权利要求5所述的设备,其特征在于,在对所述M幅图像进行图像处理,获得所述第一视频段对应的光流信息图像时,所述处理器,具体用于:
    对于所述第一图像,按预设规则获取所述视频流中处于所述第一图像之前的第p幅图像;计算所述第一图像与所述第p幅图像之间的光流信息,并生成包含所述第一图像与所述第p幅图像之间的光流信息的光流信息图像;其中,所述第一图像与所述第p幅图像之间的时间间隔不小于所述第一深度学习算法的前向计算时间以及计算所述光流信息图像所需的时间;
    或者,
    对于所述第一图像,按预设规则获取所述视频流中处于所述第一图像之前的全部p幅图像;计算所述第一图像以及所述M幅图像中每相邻两幅图像之间的光流信息,将所述每相邻两幅图像之间的光流信息进行累加后,生成包含累加后的光流信息的光流信息图像;其中,所述第一图像与所述第一图像之前的第p幅图像之间的时间间隔不小于所述第一深度学习算法的前向计算时间以及计算所述光流信息图像所需的时间。
  7. 根据权利要求5所述的设备,其特征在于,在对所述M幅图像进行图像处理,获得所述第一视频段对应的彩色信息图像时,所述处理器,具体用于:
    提取所述M幅图像中的m幅图像的彩色信息,根据提取到的彩色信息生成所述m幅图像各自对应的彩色信息图像,将所述m幅图像各自对应的彩色信息图像获取为所述第一视频段对应的彩色信息图像;所述m幅图像是所述M幅图像中随机的m幅图像,或者,所述m幅图像是所述M幅图像中,相对于各自在视频流中的前一幅图像变化最大的m幅图像,,m为大于或等于1的整数;
    或者,检测所述M幅图像中图像内容随时间变化的像素位置,计算所述M幅图像中对应识别出的像素位置处的彩色信息的平均值,获得所述识别出的像素位置处的新的彩色信息,根据所述识别出的像素位置处的新的彩色信息生成所述第一视频段对应的彩色信息图像。
  8. 根据权利要求1至7任一所述的设备,其特征在于,在获取M幅图像之前,所述处理器,还用于:
    确定所述视频流中的一个预设时间长度的时间窗,所述时间窗的结束时刻处于所述第一视频段对应的时间段内;
    根据所述时间窗内的最后一幅图像以及至少一幅参考图像,判断所述第一视频段中是否有动作发生,所述参考图像是所述时间窗内除了所述最后一幅图像之外的其它任意一幅图像;
    若判断结果为所述第一视频段中有动作发生,则执行所述获取M幅图像的步骤。
  9. 根据权利要求8所述的设备,其特征在于,在根据所述时间窗内的最后一幅图像以及至少一幅参考图像,判断所述第一视频段中是否有动作发生时,所述处理器,具体用于:
    针对所述至少一幅参考图像中的每一幅参考图像,计算所述最后一幅图像的偏导图像,所述偏导图像中的每个像素的值,是所述最后一幅图像中对应像素的值相对于所述参考图像 中对应像素的值的偏导;
    对所述偏导图像中的各个像素的值进行归一化处理,获得归一化之后的偏导图像;
    根据预设的二值化阈值,对所述归一化之后的偏导图像进行二值化处理,获得所述偏导图像的二值化图像,所述二值化图像中的各个像素的值为0或1;
    计算所述二值化图像中各个像素的灰度值之和;
    当所述灰度值之和大于0时,确定所述第一视频段中有动作发生。
  10. 根据权利要求5至7任一所述的设备,其特征在于,所述处理器,在对所述第一识别结果和所述第二识别结果进行融合,获得所述第一视频段的手势识别结果时,具体用于
    对所述第一识别结果和所述第二识别结果进行平均值计算,根据所述平均值计算的计算结果获得所述第一视频段的手势识别结果;
    或者,
    将所述第一识别结果和所述第二识别结果输入预先训练的第二机器学习模型,以获得所述第一视频段的手势识别结果。
  11. 一种手势识别装置,其特征在于,所述装置包括:
    图像获取单元,用于获取M幅图像,所述M幅图像是从视频流中的第一视频段中提取出的,其中,所述第一视频段是所述视频流中任意一个视频段,M为大于或等于2的整数;
    识别单元,用于通过深度学习算法对所述M幅图像进行手势识别,获得所述第一视频段对应的手势识别结果;
    融合单元,还用于在获得所述视频流中包含所述第一视频段在内的连续N个视频段的手势识别结果后,对所述连续N个视频段的手势识别结果进行结果融合,获得融合后的手势识别结果;N≥2,且N为整数。
  12. 根据权利要求11所述的装置,其特征在于,在对所述连续N个视频段的手势识别结果进行结果融合,获得融合后的手势识别结果时,所述融合单元,具体用于:
    将所述连续N个视频段的手势识别结果输入预先训练的第一机器学习模型,获得所述融合后的手势识别结果,所述第一机器学习模型用于确定输入的连续N个手势识别结果所构成的整体手势运动趋势,并将所述整体手势运动趋势对应的手势输出为所述融合后的手势识别结果。
  13. 根据权利要求12所述的装置,其特征在于,
    所述第一机器学习模型为神经网络模型,且所述神经网络模型的神经元数量为N;
    或者,
    所述第一机器学习模型为支持向量机SVM模型。
  14. 根据权利要求11所述的装置,其特征在于,在对所述连续N个视频段的手势识别结果进行结果融合,获得融合后的手势识别结果时,所述融合单元,具体用于:
    获取预先设置的,所述连续N个视频段的手势识别结果各自对应的权重系数;
    根据所述连续N个视频段的手势识别结果各自对应的权重系数,对所述连续N个视频段的手势识别结果进行加权平均,获得所述融合后的手势识别结果。
  15. 根据权利要求11所述的装置,其特征在于,所述识别单元,具体用于:
    对所述M幅图像进行图像处理,获得所述第一视频段对应的光流信息图像,所述光流信息图像包含所述M幅图像中的第一图像与所述第一图像之前的第p幅图像之间的光流信息,所述第一图像是所述M幅图像中的任意一幅,所述光流信息包含图像中的像素点的瞬时速度矢量信息,并通过第一深度学习算法对所述光流信息图像进行手势识别,获得第一识别结果,p为大于或等于1的整数;
    对所述M幅图像进行图像处理,获得所述第一视频段对应的彩色信息图像,所述彩色信息图像包含所述M幅图像的彩色信息,所述彩色信息包含图像中的各个像素点的色值,并通过第二深度学习算法对所述彩色信息图像进行手势识别,获得第二识别结果;
    对所述第一识别结果和所述第二识别结果进行融合,获得所述第一视频段的手势识别结果。
  16. 根据权利要求15所述的装置,其特征在于,在对所述M幅图像进行图像处理,获得所述第一视频段对应的光流信息图像时,所述识别单元,具体用于:
    对于所述第一图像,按预设规则获取所述视频流中处于所述第一图像之前的第p幅图像;计算所述第一图像与所述第p幅图像之间的光流信息,并生成包含所述第一图像与所述第p幅图像之间的光流信息的光流信息图像;其中,所述第一图像与所述第p幅图像之间的时间间隔不小于所述第一深度学习算法的前向计算时间以及计算所述光流信息图像所需的时间;
    或者,
    对于所述第一图像,按预设规则获取所述视频流中处于所述第一图像之前的全部p幅图像;计算所述第一图像以及所述M幅图像中每相邻两幅图像之间的光流信息,将所述每相邻两幅图像之间的光流信息进行累加后,生成包含累加后的光流信息的光流信息图像;其中,所述第一图像与所述第一图像之前的第p幅图像之间的时间间隔不小于所述第一深度学习算法的前向计算时间以及计算所述光流信息图像所需的时间。
  17. 根据权利要求15所述的装置,其特征在于,在对所述M幅图像进行图像处理,获得所述第一视频段对应的彩色信息图像时,所述识别单元,具体用于:
    提取所述M幅图像中的m幅图像的彩色信息,根据提取到的彩色信息生成所述m幅图像各自对应的彩色信息图像,将所述m幅图像各自对应的彩色信息图像获取为所述第一视频段对应的彩色信息图像;所述m幅图像是所述M幅图像中随机的m幅图像,或者,所述m幅图像是所述M幅图像中,相对于各自在视频流中的前一幅图像变化最大的m幅图像,m为大于或等于1的整数;
    或者,检测所述M幅图像中图像内容随时间变化的像素位置,计算所述M幅图像中对应识别出的像素位置处的彩色信息的平均值,获得所述识别出的像素位置处的新的彩色信息,根据所述识别出的像素位置处的新的彩色信息生成所述第一视频段对应的彩色信息图像。
  18. 根据权利要求11至17任一所述的装置,其特征在于,所述装置还包括:
    时间窗确定单元,用于在所述图像获取单元获取M幅图像之前,确定所述视频流中的一个预设时间长度的时间窗,所述时间窗的结束时刻处于所述第一视频段对应的时间段内;
    判断单元,用于根据所述时间窗内的最后一幅图像以及至少一幅参考图像,判断所述第一视频段中是否有动作发生,所述至少一幅参考图像是所述时间窗内除了所述最后一幅图像之外的其它任意一幅图像;
    所述图像获取单元,用于在判断结果为所述第一视频段中有动作发生时,执行所述获取M幅图像的步骤。
  19. 根据权利要求18所述的装置,其特征在于,所述判断单元,具体用于:
    针对所述至少一幅参考图像中的每一幅参考图像,计算所述最后一幅图像的偏导图像,所述偏导图像中的每个像素的值,是所述最后一幅图像中对应像素的值相对于所述参考图像中对应像素的值的偏导;
    对所述偏导图像中的各个像素的值进行归一化处理,获得归一化之后的偏导图像;
    根据预设的二值化阈值,对所述归一化之后的偏导图像进行二值化处理,获得所述偏导图像的二值化图像,所述二值化图像中的各个像素的值为0或1;
    计算所述二值化图像中各个像素的灰度值之和;
    当所述灰度值之和大于0时,确定所述第一视频段中有动作发生。
  20. 根据权利要求15至17任一所述的装置,其特征在于,在对所述第一识别结果和所述第二识别结果进行融合,获得所述第一视频段的手势识别结果时,所述识别单元,具体用于:
    对所述第一识别结果和所述第二识别结果进行平均值计算,根据所述平均值计算的计算结果获得所述第一视频段的手势识别结果;
    或者,
    将所述第一识别结果和所述第二识别结果输入预先训练的第二机器学习模型,以获得所述第一视频段的手势识别结果。
  21. 一种手势识别方法,其特征在于,所述方法包括:
    获取M幅图像,所述M幅图像是从视频流中的第一视频段中提取出的,其中,所述第一视频段是所述视频流中任意一个视频段,M为大于或等于2的整数;
    通过深度学习算法对所述M幅图像进行手势识别,获得所述第一视频段对应的手势识别结果;
    在获得所述视频流中包含所述第一视频段在内的连续N个视频段的手势识别结果后,对所述连续N个视频段的手势识别结果进行结果融合,获得融合后的手势识别结果;N≥2,且N为整数。
  22. 根据权利要求21所述的方法,其特征在于,所述对所述连续N个视频段的手势识别结果进行结果融合,获得融合后的手势识别结果,包括:
    将所述连续N个视频段的手势识别结果输入预先训练的第一机器学习模型,获得所述融合后的手势识别结果,所述第一机器学习模型用于确定输入的连续N个手势识别结果所构成的整体手势运动趋势,并将所述整体手势运动趋势对应的手势输出为所述融合后的手势识别结果。
  23. 根据权利要求22所述的方法,其特征在于,
    所述第一机器学习模型为神经网络模型,且所述神经网络模型的神经元数量为N;
    或者,
    所述第一机器学习模型为支持向量机SVM模型。
  24. 根据权利要求21所述的方法,其特征在于,所述对所述连续N个视频段的手势识别结果进行结果融合,获得融合后的手势识别结果,包括:
    获取预先设置的,所述连续N个视频段的手势识别结果各自对应的权重系数;
    根据所述连续N个视频段的手势识别结果各自对应的权重系数,对所述连续N个视频段的手势识别结果进行加权平均,获得所述融合后的手势识别结果。
  25. 根据权利要求21所述的装置,其特征在于,所述通过深度学习算法对所述M幅图像进行手势识别,获得所述第一视频段对应的手势识别结果,包括:
    对所述M幅图像进行图像处理,获得所述第一视频段对应的光流信息图像,所述光流信息图像包含所述M幅图像中的第一图像与所述第一图像之前的第p幅图像之间的光流信息,所述第一图像是所述M幅图像中的任意一幅,所述光流信息包含图像中的像素点的瞬时速度矢量信息,并通过第一深度学习算法对所述光流信息图像进行手势识别,获得第一识别结果,p为大于或等于1的整数;
    对所述M幅图像进行图像处理,获得所述第一视频段对应的彩色信息图像,所述彩色信息图像包含所述M幅图像的彩色信息,所述彩色信息包含图像中的各个像素点的色值,并通过第二深度学习算法对所述彩色信息图像进行手势识别,获得第二识别结果;
    对所述第一识别结果和所述第二识别结果进行融合,获得所述第一视频段的手势识别结果。
  26. 根据权利要求25所述的方法,其特征在于,所述对所述M幅图像进行图像处理,获得所述第一视频段对应的光流信息图像,包括:
    对于所述第一图像,按预设规则获取所述视频流中处于所述第一图像之前的第p幅图像;计算所述第一图像与所述第p幅图像之间的光流信息,并生成包含所述第一图像与所述第p幅图像之间的光流信息的光流信息图像;其中,所述第一图像与所述第p幅图像之间的时间间隔不小于所述第一深度学习算法的前向计算时间以及计算所述光流信息图像所需的时间;
    或者,
    对于所述第一图像,按预设规则获取所述视频流中处于所述第一图像之前的全部p幅图像;计算所述第一图像以及所述M幅图像中每相邻两幅图像之间的光流信息,将所述每相邻两幅图像之间的光流信息进行累加后,生成包含累加后的光流信息的光流信息图像;其中, 所述第一图像与所述第一图像之前的第p幅图像之间的时间间隔不小于所述第一深度学习算法的前向计算时间以及计算所述光流信息图像所需的时间。
  27. 根据权利要求25所述的方法,其特征在于,所述对所述M幅图像进行图像处理,获得所述第一视频段对应的彩色信息图像,包括:
    提取所述M幅图像中的m幅图像的彩色信息,根据提取到的彩色信息生成所述m幅图像各自对应的彩色信息图像,将所述m幅图像各自对应的彩色信息图像获取为所述第一视频段对应的彩色信息图像;所述m幅图像是所述M幅图像中随机的m幅图像,或者,所述m幅图像是所述M幅图像中,相对于各自在视频流中的前一幅图像变化最大的m幅图像,m为大于或等于1的整数;
    或者,检测所述M幅图像中图像内容随时间变化的像素位置,计算所述M幅图像中对应识别出的像素位置处的彩色信息的平均值,获得所述识别出的像素位置处的新的彩色信息,根据所述识别出的像素位置处的新的彩色信息生成所述第一视频段对应的彩色信息图像。
  28. 根据权利要求21至27任一所述的方法,其特征在于,所述获取M幅图像之前,所述方法还包括:
    确定所述视频流中的一个预设时间长度的时间窗,所述时间窗的结束时刻处于所述第一视频段对应的时间段内;
    根据所述时间窗内的最后一幅图像以及至少一幅参考图像,判断所述第一视频段中是否有动作发生,所述至少一幅参考图像是所述时间窗内除了所述最后一幅图像之外的其它任意一幅图像;
    若判断结果为所述第一视频段中有动作发生,则执行所述获取M幅图像的步骤。
  29. 根据权利要求28所述的方法,其特征在于,所述根据所述时间窗内的最后一幅图像以及所述至少一幅参考图像,判断所述第一视频段中是否有动作发生,包括:
    针对所述至少一幅参考图像中的每一幅参考图像,计算所述最后一幅图像的偏导图像,所述偏导图像中的每个像素的值,是所述最后一幅图像中对应像素的值相对于所述参考图像中对应像素的值的偏导;
    对所述偏导图像中的各个像素的值进行归一化处理,获得归一化之后的偏导图像;
    根据预设的二值化阈值,对所述归一化之后的偏导图像进行二值化处理,获得所述偏导图像的二值化图像,所述二值化图像中的各个像素的值为0或1;
    计算所述二值化图像中各个像素的灰度值之和;
    当所述灰度值之和大于0时,确定所述第一视频段中有动作发生。
  30. 根据权利要求25至27任一所述的方法,其特征在于,所述对所述第一识别结果和所述第二识别结果进行融合,获得所述第一视频段的手势识别结果,包括:
    对所述第一识别结果和所述第二识别结果进行平均值计算,根据所述平均值计算的计算结果获得所述第一视频段的手势识别结果;
    或者,
    将所述第一识别结果和所述第二识别结果输入预先训练的第二机器学习模型,以获得所述第一视频段的手势识别结果。
PCT/CN2017/095388 2017-08-01 2017-08-01 一种手势识别方法、装置及设备 WO2019023921A1 (zh)

Priority Applications (6)

Application Number Priority Date Filing Date Title
PCT/CN2017/095388 WO2019023921A1 (zh) 2017-08-01 2017-08-01 一种手势识别方法、装置及设备
EP17920578.6A EP3651055A4 (en) 2017-08-01 2017-08-01 METHOD, APPARATUS AND DEVICE FOR GESTURE RECOGNITION
KR1020207005925A KR102364993B1 (ko) 2017-08-01 2017-08-01 제스처 인식 방법, 장치 및 디바이스
CN201780093539.8A CN110959160A (zh) 2017-08-01 2017-08-01 一种手势识别方法、装置及设备
BR112020001729A BR112020001729A8 (pt) 2017-08-01 2017-08-01 Método, aparelho e dispositivo de reconhecimento de gestos
US16/776,282 US11450146B2 (en) 2017-08-01 2020-01-29 Gesture recognition method, apparatus, and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/095388 WO2019023921A1 (zh) 2017-08-01 2017-08-01 一种手势识别方法、装置及设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/776,282 Continuation US11450146B2 (en) 2017-08-01 2020-01-29 Gesture recognition method, apparatus, and device

Publications (1)

Publication Number Publication Date
WO2019023921A1 true WO2019023921A1 (zh) 2019-02-07

Family

ID=65232224

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/095388 WO2019023921A1 (zh) 2017-08-01 2017-08-01 一种手势识别方法、装置及设备

Country Status (6)

Country Link
US (1) US11450146B2 (zh)
EP (1) EP3651055A4 (zh)
KR (1) KR102364993B1 (zh)
CN (1) CN110959160A (zh)
BR (1) BR112020001729A8 (zh)
WO (1) WO2019023921A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110458015A (zh) * 2019-07-05 2019-11-15 平安科技(深圳)有限公司 基于图像识别的防自杀预警方法、装置、设备及存储介质
CN110728209A (zh) * 2019-09-24 2020-01-24 腾讯科技(深圳)有限公司 一种姿态识别方法、装置、电子设备及存储介质
CN111368770A (zh) * 2020-03-11 2020-07-03 桂林理工大学 基于骨骼点检测与跟踪的手势识别方法
WO2021184356A1 (en) * 2020-03-20 2021-09-23 Huawei Technologies Co., Ltd. Methods and systems for hand gesture-based control of a device
CN114564104A (zh) * 2022-02-17 2022-05-31 西安电子科技大学 一种基于视频中动态手势控制的会议演示系统
US11966516B2 (en) 2022-05-30 2024-04-23 Huawei Technologies Co., Ltd. Methods and systems for hand gesture-based control of a device

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018176000A1 (en) 2017-03-23 2018-09-27 DeepScale, Inc. Data synthesis for autonomous control systems
US11893393B2 (en) 2017-07-24 2024-02-06 Tesla, Inc. Computational array microprocessor system with hardware arbiter managing memory requests
US11157441B2 (en) 2017-07-24 2021-10-26 Tesla, Inc. Computational array microprocessor system using non-consecutive data formatting
US10671349B2 (en) 2017-07-24 2020-06-02 Tesla, Inc. Accelerated mathematical engine
US11409692B2 (en) 2017-07-24 2022-08-09 Tesla, Inc. Vector computational unit
US11561791B2 (en) 2018-02-01 2023-01-24 Tesla, Inc. Vector computational unit receiving data elements in parallel from a last row of a computational array
US11215999B2 (en) 2018-06-20 2022-01-04 Tesla, Inc. Data pipeline and deep learning system for autonomous driving
US11361457B2 (en) 2018-07-20 2022-06-14 Tesla, Inc. Annotation cross-labeling for autonomous control systems
US11636333B2 (en) 2018-07-26 2023-04-25 Tesla, Inc. Optimizing neural network structures for embedded systems
US11562231B2 (en) 2018-09-03 2023-01-24 Tesla, Inc. Neural networks for embedded devices
EP3864573A1 (en) 2018-10-11 2021-08-18 Tesla, Inc. Systems and methods for training machine models with augmented data
US11196678B2 (en) 2018-10-25 2021-12-07 Tesla, Inc. QOS manager for system on a chip communications
US11816585B2 (en) 2018-12-03 2023-11-14 Tesla, Inc. Machine learning models operating at different frequencies for autonomous vehicles
US11537811B2 (en) 2018-12-04 2022-12-27 Tesla, Inc. Enhanced object detection for autonomous vehicles based on field view
US11610117B2 (en) 2018-12-27 2023-03-21 Tesla, Inc. System and method for adapting a neural network model on a hardware platform
US10997461B2 (en) 2019-02-01 2021-05-04 Tesla, Inc. Generating ground truth for machine learning from time series elements
US11567514B2 (en) 2019-02-11 2023-01-31 Tesla, Inc. Autonomous and user controlled vehicle summon to a target
US10956755B2 (en) 2019-02-19 2021-03-23 Tesla, Inc. Estimating object properties using visual image data
WO2020251385A1 (en) * 2019-06-14 2020-12-17 Ringcentral, Inc., (A Delaware Corporation) System and method for capturing presentation gestures
CN112115801B (zh) * 2020-08-25 2023-11-24 深圳市优必选科技股份有限公司 动态手势识别方法、装置、存储介质及终端设备
US11481039B2 (en) * 2020-08-28 2022-10-25 Electronics And Telecommunications Research Institute System for recognizing user hand gesture and providing virtual reality content based on deep learning using transfer learning
US20220129667A1 (en) * 2020-10-26 2022-04-28 The Boeing Company Human Gesture Recognition for Autonomous Aircraft Operation
US20220292285A1 (en) * 2021-03-11 2022-09-15 International Business Machines Corporation Adaptive selection of data modalities for efficient video recognition
EP4320554A1 (en) * 2021-04-09 2024-02-14 Google LLC Using a machine-learned module for radar-based gesture detection in an ambient computer environment
CN115809006B (zh) * 2022-12-05 2023-08-08 北京拙河科技有限公司 一种画面控制人工指令的方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120056846A1 (en) * 2010-03-01 2012-03-08 Lester F. Ludwig Touch-based user interfaces employing artificial neural networks for hdtp parameter and symbol derivation
CN104182772A (zh) * 2014-08-19 2014-12-03 大连理工大学 一种基于深度学习的手势识别方法
CN106295531A (zh) * 2016-08-01 2017-01-04 乐视控股(北京)有限公司 一种手势识别方法和装置以及虚拟现实终端
CN106991372A (zh) * 2017-03-02 2017-07-28 北京工业大学 一种基于混合深度学习模型的动态手势识别方法

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5149033B2 (ja) * 2008-02-26 2013-02-20 岐阜車体工業株式会社 動作解析方法及び動作解析装置並びにその動作解析装置を利用した動作評価装置
CN102395984A (zh) * 2009-04-14 2012-03-28 皇家飞利浦电子股份有限公司 用于视频内容分析的关键帧提取
JP5604256B2 (ja) * 2010-10-19 2014-10-08 日本放送協会 人物動作検出装置およびそのプログラム
CN102155933B (zh) * 2011-03-08 2013-04-24 西安工程大学 一种基于视频差异分析的输电线路导线舞动测量方法
CN102854983B (zh) 2012-09-10 2015-12-02 中国电子科技集团公司第二十八研究所 一种基于手势识别的人机交互方法
US9829984B2 (en) * 2013-05-23 2017-11-28 Fastvdo Llc Motion-assisted visual language for human computer interfaces
CN103514608B (zh) * 2013-06-24 2016-12-28 西安理工大学 基于运动注意力融合模型的运动目标检测与提取方法
KR102214922B1 (ko) * 2014-01-23 2021-02-15 삼성전자주식회사 행동 인식을 위한 특징 벡터 생성 방법, 히스토그램 생성 방법, 및 분류기 학습 방법
CN103984937A (zh) * 2014-05-30 2014-08-13 无锡慧眼电子科技有限公司 基于光流法的行人计数方法
US20160092726A1 (en) * 2014-09-30 2016-03-31 Xerox Corporation Using gestures to train hand detection in ego-centric video
CN105550699B (zh) * 2015-12-08 2019-02-12 北京工业大学 一种基于cnn融合时空显著信息的视频识别分类方法
US10157309B2 (en) * 2016-01-14 2018-12-18 Nvidia Corporation Online detection and classification of dynamic gestures with recurrent convolutional neural networks
CN105787458B (zh) * 2016-03-11 2019-01-04 重庆邮电大学 基于人工设计特征和深度学习特征自适应融合的红外行为识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120056846A1 (en) * 2010-03-01 2012-03-08 Lester F. Ludwig Touch-based user interfaces employing artificial neural networks for hdtp parameter and symbol derivation
CN104182772A (zh) * 2014-08-19 2014-12-03 大连理工大学 一种基于深度学习的手势识别方法
CN106295531A (zh) * 2016-08-01 2017-01-04 乐视控股(北京)有限公司 一种手势识别方法和装置以及虚拟现实终端
CN106991372A (zh) * 2017-03-02 2017-07-28 北京工业大学 一种基于混合深度学习模型的动态手势识别方法

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110458015A (zh) * 2019-07-05 2019-11-15 平安科技(深圳)有限公司 基于图像识别的防自杀预警方法、装置、设备及存储介质
CN110728209A (zh) * 2019-09-24 2020-01-24 腾讯科技(深圳)有限公司 一种姿态识别方法、装置、电子设备及存储介质
CN110728209B (zh) * 2019-09-24 2023-08-08 腾讯科技(深圳)有限公司 一种姿态识别方法、装置、电子设备及存储介质
CN111368770A (zh) * 2020-03-11 2020-07-03 桂林理工大学 基于骨骼点检测与跟踪的手势识别方法
CN111368770B (zh) * 2020-03-11 2022-06-07 桂林理工大学 基于骨骼点检测与跟踪的手势识别方法
WO2021184356A1 (en) * 2020-03-20 2021-09-23 Huawei Technologies Co., Ltd. Methods and systems for hand gesture-based control of a device
CN114564104A (zh) * 2022-02-17 2022-05-31 西安电子科技大学 一种基于视频中动态手势控制的会议演示系统
US11966516B2 (en) 2022-05-30 2024-04-23 Huawei Technologies Co., Ltd. Methods and systems for hand gesture-based control of a device

Also Published As

Publication number Publication date
EP3651055A1 (en) 2020-05-13
KR102364993B1 (ko) 2022-02-17
BR112020001729A2 (pt) 2020-07-21
KR20200036002A (ko) 2020-04-06
US20200167554A1 (en) 2020-05-28
EP3651055A4 (en) 2020-10-21
BR112020001729A8 (pt) 2023-04-11
CN110959160A (zh) 2020-04-03
US11450146B2 (en) 2022-09-20

Similar Documents

Publication Publication Date Title
WO2019023921A1 (zh) 一种手势识别方法、装置及设备
US20210264133A1 (en) Face location tracking method, apparatus, and electronic device
US20190346932A1 (en) Motion-Assisted Visual Language for Human Computer Interfaces
CN109426782B (zh) 对象检测方法和用于对象检测的神经网络系统
CN107798272B (zh) 快速多目标检测与跟踪系统
US9710698B2 (en) Method, apparatus and computer program product for human-face features extraction
JP6555906B2 (ja) 情報処理装置、情報処理方法、およびプログラム
US20160300100A1 (en) Image capturing apparatus and method
CN109389086B (zh) 检测无人机影像目标的方法和系统
US9652850B2 (en) Subject tracking device and subject tracking method
CN111062263B (zh) 手部姿态估计的方法、设备、计算机设备和存储介质
KR102434397B1 (ko) 전역적 움직임 기반의 실시간 다중 객체 추적 장치 및 방법
JP2016143335A (ja) グループ対応付け装置、グループ対応付け方法及びグループ対応付け用コンピュータプログラム
KR101313879B1 (ko) 기울기 히스토그램을 이용한 사람 검출 추적 시스템 및 방법
KR20060121503A (ko) 무인 감시 로봇에서 중요 얼굴 추적 장치 및 방법
CN114613006A (zh) 一种远距离手势识别方法及装置
US20230033548A1 (en) Systems and methods for performing computer vision task using a sequence of frames
JP2012033054A (ja) 顔画像サンプル採取装置、顔画像サンプル採取方法、プログラム
KR101909326B1 (ko) 얼굴 모션 변화에 따른 삼각 매쉬 모델을 활용하는 사용자 인터페이스 제어 방법 및 시스템
US11875518B2 (en) Object feature extraction device, object feature extraction method, and non-transitory computer-readable medium
Kishore et al. DSLR-Net a depth based sign language recognition using two stream convents
US20230007167A1 (en) Image processing device and image processing system, and image processing method
TW202411949A (zh) 臉部屬性的串級偵測
CN115909497A (zh) 一种人体姿态识别方法及装置
CN117765439A (zh) 目标对象检测方法、车辆的控制方法、装置、芯片和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17920578

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112020001729

Country of ref document: BR

ENP Entry into the national phase

Ref document number: 2017920578

Country of ref document: EP

Effective date: 20200204

ENP Entry into the national phase

Ref document number: 20207005925

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 112020001729

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20200127