WO2020057329A1 - 视频动作的识别方法、装置、设备及存储介质 - Google Patents

视频动作的识别方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2020057329A1
WO2020057329A1 PCT/CN2019/102717 CN2019102717W WO2020057329A1 WO 2020057329 A1 WO2020057329 A1 WO 2020057329A1 CN 2019102717 W CN2019102717 W CN 2019102717W WO 2020057329 A1 WO2020057329 A1 WO 2020057329A1
Authority
WO
WIPO (PCT)
Prior art keywords
video frame
action
motion
current video
frame
Prior art date
Application number
PCT/CN2019/102717
Other languages
English (en)
French (fr)
Inventor
宋丽
石峰
王璠
芦姗
Original Assignee
广州市百果园信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市百果园信息技术有限公司 filed Critical 广州市百果园信息技术有限公司
Priority to EP19862600.4A priority Critical patent/EP3862914A4/en
Priority to US17/278,195 priority patent/US20220130146A1/en
Publication of WO2020057329A1 publication Critical patent/WO2020057329A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Definitions

  • the embodiments of the present application relate to the technical field of motion recognition, for example, to a method, a device, a device, and a storage medium for recognizing a video motion.
  • Video gesture recognition is often used in application scenarios that require strong interaction. When the user's gestures are continuously located and recognized, they face uncontrollable factors such as complex backgrounds, blurred motion, and non-standard actions.
  • the gesture recognition processing performed on the image in the video in the related art cannot guarantee the stability and smoothness of the gesture recognition result.
  • the embodiments of the present application provide a method, a device, a device, and a storage medium for recognizing a video action, which can improve the stability and smoothness of the action recognition result.
  • an embodiment of the present application provides a method for identifying a video action, including: determining an action category and motion positioning information of the current video frame according to a current video frame and at least one forward video frame; The action category and action positioning information determine the action content of the video.
  • an embodiment of the present application further provides a video motion recognition device, including: a motion category and motion positioning information determination module, configured to determine the current video frame according to a current video frame and at least one forward video frame.
  • the action category and action positioning information configured to determine the action content of the video according to the action type and action positioning information of the video frame.
  • an embodiment of the present application further provides a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor.
  • the processor executes the program, the processor implements the present invention.
  • the method for identifying a video action according to the embodiment of the application.
  • an embodiment of the present application further provides a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements a method for identifying a video action as described in the embodiment of the present application.
  • FIG. 1 is a flowchart of a video action recognition method according to an embodiment of the present application
  • FIG. 2 is a flowchart of a video action recognition method according to an embodiment of the present application.
  • FIG. 3 is a flowchart of a method for identifying a video action according to an embodiment of the present application
  • 4a is a recognition effect diagram of a "like" gesture in an embodiment of the present application.
  • 4b is a recognition effect diagram of a "like" gesture in an embodiment of the present application.
  • 4c is a recognition effect diagram of a "like" gesture in an embodiment of the present application.
  • 4d is a diagram showing a recognition effect of a "like" gesture in an embodiment of the present application.
  • FIG. 5a is a recognition effect diagram of a gesture of “two hands to heart” in an embodiment of the present application.
  • FIG. 5b is a recognition effect diagram of a gesture of “two hands to heart” in an embodiment of the present application.
  • FIG. 5c is a recognition effect diagram of a "hands-to-heart" gesture in an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a video motion recognition device according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.
  • FIG. 1 is a flowchart of a method for identifying a video action according to an embodiment of the present application. This embodiment is applicable to a case of identifying a user's action in a live video.
  • the method may be implemented by a device for identifying a video action.
  • the device may be composed of at least one of hardware and software, and may generally be integrated in a device having a video motion recognition function.
  • the device may be an electronic device such as a server, a mobile terminal, or a server cluster. As shown in FIG. 1, the method includes steps 110 to 120.
  • step 110 the motion category and motion positioning information of the current video frame are determined according to the current video frame and at least one forward video frame.
  • the forward video frame may be a video frame before the moment corresponding to the current video frame.
  • Video can be live or on-demand video.
  • the actions may include gestures, postures, etc. of the user, and in one embodiment are gestures.
  • the action type may be a form of the gesture
  • the motion positioning information may be a movement track of the gesture.
  • the form of the gesture may include a like gesture, an "OK" gesture, a two-handed heart gesture, a one-handed heart gesture, a gun gesture, a "Yeah” gesture, and a hand-hold gesture.
  • the manner of determining the motion category and motion positioning information of the current video frame may be to input the current video frame and at least one forward video frame into the motion recognition model simultaneously.
  • the motion recognition model obtains the motion category and motion positioning information of the current video frame by analyzing the current video frame and at least one forward video frame; or, the current video frame and the at least one forward video frame are respectively input into the motion recognition model to obtain The motion category and motion positioning information corresponding to each video frame.
  • use the motion category and motion positioning information of at least one forward video frame to modify the motion category and motion positioning information of the current video frame to obtain the target motion of the current video frame.
  • Category and target action positioning information are examples of the motion category and motion positioning information.
  • step 120 the motion content of the video is determined according to the motion category and motion positioning information of the video frame.
  • the action content may be the information to be conveyed by the action.
  • the action content may include: Like, "OK”, two-handed heart, one-handed heart, gun, "Yeah” and care Wait.
  • the motion content in the video can be determined.
  • a set special effect may be triggered at the motion positioning point in combination with the motion category.
  • the technical solution of this embodiment determines the motion category and motion positioning information of the current video frame according to the current video frame and at least one forward video frame, and finally determines the motion content of the video according to the motion type and motion positioning information of the video frame.
  • the method for identifying a video action provided in the embodiment of the present application determines the motion category and motion positioning information of the current video frame according to the current video frame and at least one forward video frame, which can improve the stability of the motion category recognition and the recognition of the motion positioning information. Smoothness.
  • FIG. 2 is a flowchart of a video action recognition method according to an embodiment of the present application.
  • determining the motion category and motion positioning information of the current video frame may be implemented through steps 210 to 230.
  • step 210 a current video frame is acquired, and a motion recognition result of the current video frame is determined.
  • the action recognition result includes action type and action positioning information.
  • the motion positioning information may be motion frame positioning information, including the width of the motion frame, the height of the motion frame, and the center coordinates of the motion frame.
  • the motion category and motion positioning information of the current video frame can be obtained.
  • the way to determine the action category of the current video frame may be to input the current video frame into the action recognition model, obtain the confidence of at least one set action category, and select the set action category with the highest confidence as the current video. The action category of the frame.
  • the motion recognition model may be obtained based on convolutional neural network training, and has a function of identifying motion categories and motion positioning information in a video frame.
  • the set action category can be a category set in the system in advance. Assuming the action is a gesture, the set action category can include likes, "OK", two-handed heart comparison, one-handed heart comparison, gun comparison, "Yeah", and care. Wait. After the current video frame is input into the motion recognition model, the confidence level of the set motion category corresponding to the current video frame is obtained, and the set motion category with the highest confidence level is used as the motion category of the current video frame.
  • the confidence level of the set action category corresponding to the current video frame is: like 0.1, “OK” 0.25, two-hand specific heart 0.3, one-hand specific heart 0.3, specific gun 0.8, “Yeah "Is 0.4 and the holding hand is 0.2, the action category of the current video frame is" compared to the gun ".
  • the way to determine the motion positioning information of the current video frame may be to input the current video frame into the motion recognition model, and output the width of the motion frame, the height of the motion frame, and the center coordinates of the motion frame.
  • step 220 the motion category of the current video frame is modified according to the motion category of the at least one forward video frame to obtain the target motion category of the current video frame.
  • the action category of the current video frame is modified according to the action category of the at least one forward video frame, and obtaining the target action category of the current video frame may be implemented in the following manner: for each set action category, Sum the confidences of the set action category in at least one forward video frame and the current video frame. Gets the set action category with the highest confidence sum. In the action category of at least one forward video frame and the current video frame, if the same number as the set action category with the highest sum of confidence exceeds the set number, the action with the highest sum of confidence is set The category is determined as the target action category. In the action category of at least one forward video frame and the current video frame, if the same number as the set action category with the highest sum of confidence values does not exceed the set number, the action category of the current video frame is determined as the target Action category.
  • the set number may be determined according to the number of forward video frames.
  • the set number may be any value between 50% and 80% of the number of forward video frames.
  • the target action category can be determined according to the following formula: Where C is the target action category, j is the set number, and c i is the action category of the current video frame.
  • step 230 the motion positioning information of the current video frame is modified according to the motion positioning information of the previous video frame of the current video frame to obtain the target motion positioning information of the current video frame.
  • the motion positioning information of the previous frame of the current video frame is used for correction.
  • the video action recognition method further includes the following steps: determining whether an absolute value of a difference between the positioning information of the target action frame and the positioning information of the action frame of the previous video frame is less than a set threshold; based on the target action The absolute value of the difference between the positioning information of the frame and the positioning information of the motion frame of the previous video frame is less than the set threshold, and the positioning information of the target motion frame is updated to the positioning information of the motion frame of the forward video frame.
  • the set threshold can be set to any value between 1-10 pixels. In one embodiment, it is set to 3 or 4 pixels.
  • the width of the target action frame is updated to the value of the previous video frame.
  • the width of the action frame if the absolute value of the difference between the height of the target action frame and the height of the action frame of the previous video frame is less than the set threshold, the height of the target action frame is updated to that of the previous video frame.
  • the height of the action box if the absolute value of the difference between the abscissa of the center coordinate of the target action box and the abscissa of the center coordinate of the action box of the previous video frame is less than the set threshold, set the center of the target action box
  • the abscissa of the coordinate is updated to the abscissa of the center coordinate of the action frame of the previous video frame; the difference between the ordinate of the center coordinate of the target action frame and the ordinate of the center coordinate of the action frame of the previous video frame
  • the absolute value is smaller than the set threshold, the ordinate of the center coordinate of the target action frame is updated to the ordinate of the center coordinate of the action frame of the previous video frame.
  • the motion category of the current video frame is modified according to the motion category of at least one forward video frame, and the motion positioning of the current video frame is determined based on the motion positioning information of the previous video frame of the current video frame. Information is amended. Can provide the stability and smoothness of video motion recognition.
  • FIG. 3 is a flowchart of a method for identifying a video action according to an embodiment of the present application.
  • acquiring a current video frame and determining a motion recognition result of the current video frame may be implemented through steps 310 to 330.
  • step 310 it is determined whether the current video frame is a preset key frame.
  • the preset key frame may be determined as a key frame at every set number of video frames set according to actual needs, for example, every 10 video frames are determined as a key frame.
  • step 320 based on the judgment result that the current video frame is a preset key frame, the current video frame is input into the first motion recognition sub-model to obtain initial motion positioning information of the current video frame; the current video frame is determined based on the initial motion frame positioning information.
  • the first to-be-recognized image region is input to the second motion recognition sub-model to obtain the motion recognition result of the current video frame.
  • the first motion recognition sub-model and the second motion recognition sub-model are obtained by training with different convolutional neural networks.
  • the first motion recognition sub-model can be obtained using DenseNet (Dense, Convolutional Network) or ResNet; the second motion recognition sub-model can be obtained using MobileNet-v2.
  • the current video frame is input to the first motion recognition sub-model to obtain initial motion positioning information. After obtaining the initial motion positioning information, circle the initial motion frame. On the basis of the area, the first area to be recognized is obtained after the set area is enlarged or the pixels are set. Finally, the first area to be recognized is input to the second motion recognition sub-model to obtain the motion recognition result of the current video frame.
  • a second to-be-recognized image region of the current video frame is determined according to the motion frame positioning information of the previous video frame, and the second to-be-recognized image region is input to the Two motion recognition sub-models to obtain the motion recognition results of the current video frame.
  • the second area is obtained by expanding the set area or the set pixel points.
  • the image area to be identified is finally inputted into the second motion recognition sub-model to obtain the motion recognition result of the current video frame.
  • a preset key frame is sequentially input to a first motion recognition sub-model and a second motion recognition sub-model to obtain a motion recognition result
  • a non-preset key frame is input to a second motion recognition sub-model to obtain a motion recognition result.
  • the rate of image recognition can be increased.
  • Figs. 4a-4d are recognition effect diagrams of the "Like” gesture provided by an embodiment of the present application, as shown in Fig. 4a-4d, where "2" indicates that the gesture category of the video frame is "Like” .
  • the gesture category of the video frame is "2", that is, like, and gesture positioning information, including the width of the gesture frame, the height of the gesture frame, and the center coordinates of the gesture frame.
  • FIGS. 5a-5c are recognition effect diagrams of a gesture of “two hands than heart” provided by an embodiment of the present application, as shown in FIG. 5a-5c, where “5” indicates that the gesture type of a video frame is “two hands than heart”.
  • the gesture category of the video frame is "5", that is, the ratio of the hands to the heart, and the gesture positioning information, including the width of the gesture frame, the height of the gesture frame, and the center coordinates of the gesture frame. .
  • FIG. 6 is a schematic structural diagram of a video motion recognition device according to an embodiment of the present application. As shown in FIG. 6, the device includes: an action category and action positioning information determination module 610 and an action content determination module 620.
  • the action category and action positioning information determining module 610 is configured to determine the action category and action positioning information of the current video frame according to the current video frame and at least one forward video frame;
  • the action content determination module 620 is configured to determine the action content of the video according to the action type and action positioning information of the video frame.
  • the action category and action positioning information determining module 610 is configured to:
  • the motion positioning information of the current video frame is modified to obtain the target motion positioning information of the current video frame.
  • the action category and action positioning information determining module 610 is configured to:
  • the set action category with the highest confidence is selected as the action category of the current video frame.
  • the action category and action positioning information determining module 610 is configured to:
  • the action category of at least one forward video frame and the current video frame if the same number as the set action category with the highest sum of confidence exceeds the set number, the action with the highest sum of confidence is set
  • the category is determined as the target action category
  • the action category of the current video frame is determined as the target Action category.
  • the motion positioning information includes a width of the motion frame, a height of the motion frame, and a center coordinate of the motion frame.
  • the action category and action positioning information determining module 610 is configured to:
  • x is the width of the target action box or the height of the target action box
  • k is the gain factor
  • x1 is the width of the action box of the current video frame or the height of the action box of the current video frame
  • x2 is the action of the previous video frame The width of the frame or the height of the action frame of the previous video frame
  • Y is the center coordinate of the target motion frame
  • Y2 is the center coordinate of the motion frame of the previous video frame
  • K is the gain matrix
  • H is the identity matrix
  • Y1 is the center coordinate of the motion frame of the current video frame.
  • the action category and action positioning information determining module 610 is configured to:
  • the positioning information of the target motion frame is updated to that of the motion frame of the previous video frame. Location information.
  • the gain factor is calculated according to the following formula:
  • p - p + q
  • p the posteriori error
  • p- priori error the process deviation
  • r the measurement error
  • the action category and action positioning information determining module 610 is configured to:
  • the current video frame is input to the first motion recognition sub-model to obtain the initial motion positioning information of the current video frame; the first to-be-recognized image region of the current video frame is determined according to the initial motion positioning information. And input the first image recognition region to the second motion recognition sub-model to obtain the motion recognition result of the current video frame; wherein the first motion recognition sub-model and the second motion recognition sub-model are obtained by training with different convolutional neural networks ;
  • the current video frame is not a preset key frame
  • the action is a user's gesture
  • the action type is the form of the gesture
  • the motion positioning information is the movement track of the gesture.
  • the above device can execute the methods provided in all the foregoing embodiments of the present application.
  • FIG. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.
  • FIG. 7 shows a block diagram of a computer device 712 suitable for implementing the embodiments of the present application.
  • the computer device 712 shown in FIG. 7 is merely an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present application.
  • the device 712 is typically a computing device that assumes the recognition function of a video action.
  • the computer device 712 is represented in the form of a general-purpose computing device.
  • the components of the computer device 712 may include, but are not limited to, at least one processor 716, a storage device 728, and a bus 718 connecting different system components (including the storage device 728 and the processor 716).
  • the bus 718 represents one or more of several types of bus structures, including a memory bus or a memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local area bus using any of a variety of bus structures.
  • these architectures include, but are not limited to, the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MCA) bus, the enhanced ISA bus, and the Video Electronics Standards Association (VESA) local area bus and Peripheral Component Interconnect (PCI) bus.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • Computer device 712 typically includes a variety of computer system-readable media. These media can be any available media that can be accessed by the computer device 712, including volatile and non-volatile media, removable and non-removable media.
  • the storage device 728 may include a computer system-readable medium in the form of a volatile memory, such as at least one of a Random Access Memory (RAM) 730 and a cache memory 732.
  • Computer device 712 may further include other removable / non-removable, volatile / nonvolatile computer system storage media.
  • the storage system 734 may be used to read and write non-removable, non-volatile magnetic media (not shown in FIG. 7 and is commonly referred to as a "hard drive").
  • each drive may be connected to the bus 718 through at least one data medium interface.
  • the storage device 728 may include at least one program product having a set (for example, at least one) of program modules configured to perform the functions of the embodiments of the present application.
  • a program 736 having a set of (at least one) program modules 726 may be stored in, for example, a storage device 728.
  • Such program modules 726 include, but are not limited to, an operating system, at least one application program, other program modules, and program data. In these examples, Each or some combination may include the implementation of the network environment.
  • Program module 726 typically performs functions and / or methods in the embodiments described herein.
  • the computer device 712 may also communicate with at least one external device 714 (eg, a keyboard, pointing device, camera, display 724, etc.), and may also communicate with at least one device that enables a user to interact with the computer device 712, and / or with
  • the computer device 712 can communicate with any device (eg, a network card, modem, etc.) that is in communication with at least one other computing device. This communication can be performed through an input / output (I / O) interface 722.
  • the computer device 712 may also communicate with at least one network (for example, a local area network (LAN), a wide area network (Wide Area Network, WAN), and / or a public network, such as the Internet) through the network adapter 720.
  • LAN local area network
  • WAN Wide Area Network
  • public network such as the Internet
  • the network adapter 720 communicates with other modules of the computer device 712 via a bus 718. It should be understood that although not shown in the figure, other hardware and / or software modules may be used in conjunction with the computer device 712, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, and disk arrays (Redundant Arrays) of Independent Disks (RAID) systems, tape drives, and data backup storage systems.
  • RAID Redundant Arrays of Independent Disks
  • the processor 716 executes various functional applications and data processing by running a program stored in the storage device 728, for example, implementing a method for identifying a video action provided in the foregoing embodiment of the present application.
  • the sixth embodiment of the present application further provides a computer-readable storage medium having a computer program stored thereon.
  • the program is executed by a processor, the method for identifying a video action as provided in the embodiment of the present application is implemented.
  • the computer-readable storage medium provided in the embodiment of the present application is not limited to the method operations described above, and the computer program stored on the computer program may also perform the video action recognition method provided in any embodiment of the present application. Related operations.
  • the computer storage medium in the embodiment of the present application may adopt any combination of at least one computer-readable medium.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in combination with an instruction execution system, apparatus, or device.
  • the computer-readable signal medium may include a data signal in baseband or propagated as part of a carrier wave, which carries a computer-readable program code. Such a propagated data signal may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, and the computer-readable medium may send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for performing the operations of this application may be written in one or more programming languages, or a combination thereof, including programming languages such as Java, Smalltalk, C ++, and also conventional Procedural programming language—such as "C" or similar programming language.
  • the program code can be executed entirely on the user's computer, partly on the user's computer, as an independent software package, partly on the user's computer, partly on a remote computer, or entirely on a remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider) Internet connection).
  • LAN local area network
  • WAN wide area network
  • Internet service provider Internet service provider

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Psychiatry (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Social Psychology (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种视频动作的识别方法、装置、设备及存储介质。该方法包括:根据当前视频帧和至少一个前向视频帧,确定所述当前视频帧的动作类别和动作定位信息;根据视频帧的动作类别和动作定位信息,确定视频的动作内容。

Description

视频动作的识别方法、装置、设备及存储介质
本申请要求在2018年9月21日提交中国专利局、申请号为201811107097.0的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及动作识别技术领域,例如涉及一种视频动作的识别方法、装置、设备及存储介质。
背景技术
随着计算机视觉和机器学习等相关技术的快速发展,人机交互技术越来越受到重视。在人机交互中,需要对用户的身体信息进行识别如人脸识别、手势识别及体势识别等。其中,手势识别可以作为直观的交流方式而具有重要的研究价值和意义。
视频手势识别通常应用在需要强交互的应用场景中,在对用户的手势进行持续的定位与识别时,会面临诸如背景复杂、运动模糊、动作不标准等不可控因素。
相关技术中对视频中图像进行的手势识别处理,无法保证手势识别结果的稳定性及平滑性。
发明内容
本申请实施例提供一种视频动作的识别方法、装置、设备及存储介质,可以提高动作识别结果的稳定性及平滑性。
第一方面,本申请实施例提供了一种视频动作的识别方法,包括:根据当前视频帧和至少一个前向视频帧,确定所述当前视频帧的动作类别和动作定位信息;根据视频帧的动作类别和动作定位信息,确定视频的动作内容。
第二方面,本申请实施例还提供了一种视频动作的识别装置,包括:动作类别和动作定位信息确定模块,设置为根据当前视频帧和至少一个前向视频帧,确定所述当前视频帧的动作类别和动作定位信息;动作内容确定模块,设置为根据视频帧的动作类别和动作定位信息,确定视频的动作内容。
第三方面,本申请实施例还提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现如本申请实施例所述的视频动作的识别方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时,实现如本申请实施例所述的视频动作 的识别方法。
附图说明
图1是本申请一实施例中的一种视频动作的识别方法的流程图;
图2是本申请一实施例中的一种视频动作的识别方法的流程图;
图3是本申请一实施例中的一种视频动作的识别方法的流程图;
图4a是本申请一实施例中的对“点赞”手势的识别效果图;
图4b是本申请一实施例中的对“点赞”手势的识别效果图;
图4c是本申请一实施例中的对“点赞”手势的识别效果图;
图4d是本申请一实施例中的对“点赞”手势的识别效果图;
图5a是本申请一实施例中的对“双手比心”手势的识别效果图;
图5b是本申请一实施例中的对“双手比心”手势的识别效果图;
图5c是本申请一实施例中的对“双手比心”手势的识别效果图;
图6是本申请一实施例中的一种视频动作的识别装置的结构示意图;
图7是本申请一实施例中的一种计算机设备的结构示意图。
具体实施方式
图1为本申请一实施例提供的一种视频动作的识别方法的流程图,本实施例可适用于对直播视频中的用户的动作进行识别的情况,该方法可以由视频动作的识别装置来执行,该装置可由硬件和软件中至少之一组成,并一般可集成在具有视频动作识别功能的设备中,该设备可以是服务器、移动终端或服务器集群等电子设备。如图1所示,该方法包括步骤110至步骤120。
在步骤110中,根据当前视频帧和至少一个前向视频帧,确定当前视频帧的动作类别和动作定位信息。
其中,前向视频帧可以是当前视频帧对应时刻之前的视频帧。视频可以是直播视频或点播视频。动作可以包括用户的手势、体势等,在一实施例中为手势。在动作为手势的情况下,动作类别可以是手势的形态,动作定位信息可以是手势的移动轨迹。例如:手势的形态可以包括:点赞手势、“OK”手势、双手比心手势、单手比心手势、比枪手势、“Yeah”手势及托手手势等。
本实施例中,根据当前视频帧和至少一个前向视频帧,确定当前视频帧的动作类别和动作定位信息的方式可以是,将当前视频帧和至少一个前向视频帧同时输入动作识别模型中,动作识别模型通过分析当前视频帧和至少一个前向视频帧,获得当前视频帧的动作类别和动作定位信息;或者,将当前视频帧和至少一个前向视频帧分别输入动作识别模型中,获得每个视频帧分别对应的动作类别和动作定位信息,最后利用至少一个前向视频帧的动作类别和动作定位 信息对当前视频帧的动作类别和动作定位信息进行修正,获得当前视频帧的目标动作类别和目标动作定位信息。
在步骤120中,根据视频帧的动作类别和动作定位信息,确定视频的动作内容。
其中,动作内容可以是动作要传达的信息,示例性的,以手势为例,动作内容可以包括:点赞、“OK”、双手比心、单手比心、比枪、“Yeah”及托手等。
在获得视频中视频帧的动作类别和动作定位信息后,就可以确定视频中的动作内容。在一实施例中,在本应用场景下,在获得视频帧的动作类别和动作定位信息后,可以结合动作类别在动作定位点处触发设定特效。
本实施例的技术方案,根据当前视频帧和至少一个前向视频帧,确定当前视频帧的动作类别和动作定位信息,最后根据视频帧的动作类别和动作定位信息,确定视频的动作内容。本申请实施例提供的视频动作的识别方法,根据当前视频帧和至少一个前向视频帧,确定当前视频帧的动作类别和动作定位信息,可以提高动作类别识别的稳定性及动作定位信息识别的平滑性。
图2为本申请一实施例提供的一种视频动作的识别方法的流程图。作为对上述实施例的解释,如图2所示,根据当前视频帧和至少一个前向视频帧,确定当前视频帧的动作类别和动作定位信息,可通过步骤210至步骤230实施。
在步骤210中,获取当前视频帧,确定当前视频帧的动作识别结果。
其中,动作识别结果包括动作类别及动作定位信息。动作定位信息可以是动作框定位信息,包括动作框的宽、动作框的高以及动作框的中心坐标。
本实施例中,将当前视频帧输入动作识别模型,就可以获得当前视频帧的动作类别和动作定位信息。在一实施例中,确定当前视频帧的动作类别的方式可以是将当前视频帧输入动作识别模型,获得至少一个设定动作类别的置信度,选取置信度最高的设定动作类别,作为当前视频帧的动作类别。
其中,动作识别模型可以是基于卷积神经网络训练获得的,具有识别视频帧中动作类别与动作定位信息的功能。设定动作类别可以是在系统中预先设置的类别,假设动作是手势,则设定动作类别可以包括点赞、“OK”、双手比心、单手比心、比枪、“Yeah”及托手等。将当前视频帧输入动作识别模型后,获得当前视频帧对应的设定动作类别的置信度,将置信度最高的设定动作类别作为当前视频帧的动作类别。示例性的,假设当前视频帧对应的设定动作类别的置信度为:点赞为0.1、“OK”为0.25、双手比心为0.3、单手比心为0.3、比枪为0.8、“Yeah”为0.4及托手为0.2,则当前视频帧的动作类别为“比枪”。
在一实施例中,确定当前视频帧的动作定位信息的方式可以是,将当前视频帧输入动作识别模型,输出动作框的宽、动作框的高以及动作框的中心坐标。
在步骤220中,根据至少一个前向视频帧的动作类别,对当前视频帧的动 作类别进行修正,获取当前视频帧的目标动作类别。
在一实施例中,根据至少一个前向视频帧的动作类别,对当前视频帧的动作类别进行修正,获取当前视频帧的目标动作类别可通过下述方式实施:针对每个设定动作类别,将至少一个前向视频帧和当前视频帧中的该设定动作类别的置信度进行求和。获取置信度的和值最高的设定动作类别。在至少一个前向视频帧和当前视频帧的动作类别中,与置信度的和值最高的设定动作类别相同的数量超过设定数量的情况下,将置信度的和值最高的设定动作类别确定为目标动作类别。在至少一个前向视频帧和当前视频帧的动作类别中,与置信度的和值最高的设定动作类别相同的数量未超过设定数量的情况下,将当前视频帧的动作类别确定为目标动作类别。
其中,设定数量可以根据前向视频帧的数量来确定,例如设定数量可以是前向视频帧数量的50%-80%之间的任意值。在一实施例中,获取置信度的和值最高的设定动作类别可根据如下公式计算
Figure PCTCN2019102717-appb-000001
for c=1,…,N(i>k+1),其中,n为置信度的和值最高的设定动作类别,probf为第f帧视频帧的设定动作类别置信度向量,N为设定动作类别的数量,c为设定动作类别对应的编号,前向视频帧的数量为k-1,
Figure PCTCN2019102717-appb-000002
表示在probf取c的情况下,值为1,在probf取其他值的情况下,值为0。在获得置信度的和值最高的设定动作类别为n后,确定目标动作类别可根据如下公式计算:
Figure PCTCN2019102717-appb-000003
其中C为目标动作类别,j为设定数量,c i为当前视频帧的动作类别。
在步骤230中,根据当前视频帧的前一帧视频帧的动作定位信息,对当前视频帧的动作定位信息进行修正,获取当前视频帧的目标动作定位信息。
本应该场景下,对当前视频帧的动作定位信息进行修正时,根据当前视频帧前一帧的动作定位信息进行修正。在一实施例中,根据前一帧视频帧的动作定位信息,对当前视频帧的动作定位信息进行修正,获取当前视频帧的目标动作定位信息,可通过下述方式实施:对于动作框的宽或动作框的高,获取增益因子,根据增益因子按照如下公式计算目标动作框的宽或目标动作框的高:x=x2+k(x1-x2);其中,x为目标动作框的宽或目标动作框的高,k为增益因子,x1为当前视频帧的动作框的宽或当前视频帧的动作框的高,x2为前一帧视频帧的动作框的宽或前一帧视频帧的动作框的高。对于动作框的中心坐标,获取增益矩阵;根据增益矩阵按照如下公式计算目标动作框的中心坐标:Y=Y2+K*(Y1-H*Y2);其中,Y为目标动作框的中心坐标,Y2为前一帧视频帧 的动作框的中心坐标,K为增益矩阵,H为单位矩阵,Y1为当前视频帧的动作框的中心坐标。
增益因子可以按照如下公式计算:k=p -/(p -+r);其中,p -=p+q,p为后验误差,p-为先验误差,q为过程偏差,r为测量偏差。过程偏差和测量偏差可以经过多次试验后获得的值,后验误差可以按照如下公式迭代获取:p=(1-k)*p -
增益矩阵可以按照如下公式计算:K=P -*H T*S -1;其中,P -=A*err*A T+Q,S=H*P -*H T+R,A为运动矩阵,Q为过程方差矩阵,R为测量方差矩阵,err为中心点误差矩阵,H为单位矩阵。中心点误差矩阵可以按照如下公式迭代获得:err=(1-K*H)*P -
在一实施例中,该视频动作的识别方法还包括如下步骤:判断目标动作框的定位信息与前一帧视频帧的动作框的定位信息差值的绝对值是否小于设定阈值;基于目标动作框的定位信息与前一帧视频帧的动作框的定位信息差值的绝对值小于设定阈值的判断结果,将目标动作框的定位信息更新为前向视频帧的动作框的定位信息。
其中,设定阈值可以设置为1-10个像素点之间的任意值,在一实施例中,设置为3或者4个像素。
本实施例中,在目标动作框的宽与前一帧视频帧的动作框的宽的差值的绝对值小于设定阈值的情况下,将目标动作框的宽更新为前一帧视频帧的动作框的宽;在目标动作框的高与前一帧视频帧的动作框的高的差值的绝对值小于设定阈值的情况下,将目标动作框的高更新为前一帧视频帧的动作框的高;在目标动作框的中心坐标的横坐标与前一帧视频帧的动作框的中心坐标的横坐标的差值的绝对值小于设定阈值的情况下,将目标动作框的中心坐标的横坐标更新为前一帧视频帧的动作框的中心坐标的横坐标;在目标动作框的中心坐标的纵坐标与前一帧视频帧的动作框的中心坐标的纵坐标的差值的绝对值小于设定阈值的情况下,将目标动作框的中心坐标的纵坐标更新为前一帧视频帧的动作框的中心坐标的纵坐标。
本实施例的技术方案,根据至少一个前向视频帧的动作类别,对当前视频帧的动作类别进行修正,根据当前视频帧的前一帧视频帧的动作定位信息,对当前视频帧的动作定位信息进行修正。可以提供视频动作识别的稳定性和平滑性。
图3为本申请一实施例提供的一种视频动作的识别方法的流程图。作为对上述实施例的解释,如图3所示,获取当前视频帧,确定所述当前视频帧的动作识别结果,可通过步骤310至步骤330实施。
在步骤310中,判断当前视频帧是否为预设关键帧。
其中,预设关键帧可以是根据实际需要设置的每隔设定数量的视频帧确定 为一个关键帧,例如每隔10帧视频帧确定为一个关键帧。
在步骤320中,基于当前视频帧是预设关键帧的判断结果,将当前视频帧输入第一动作识别子模型获得当前视频帧的初始动作定位信息;根据初始动作框定位信息确定当前视频帧的第一待识别图像区域,并将第一待识别图像区域输入第二动作识别子模型,获得当前视频帧的动作识别结果。
其中,第一动作识别子模型和第二动作识别子模型采用不同的卷积神经网络训练获得。第一动作识别子模型可以采用DenseNet(Dense Convolutional Network)或者ResNet获得;第二动作识别子模型可以采用MobileNet-v2获得。
在一实施例中,基于当前视频帧为预设关键帧的判断结果,将当前视频帧输入第一动作识别子模型获得初始动作定位信息,在获得初始动作定位信息后,在初始动作框圈出的区域的基础上,扩大设定面积或设定像素点后,获得第一待识别图像区域,最后将第一待识别图像区域输入第二动作识别子模型,获得当前视频帧的动作识别结果。
在步骤330中,基于当前视频帧不是预设关键帧的判断结果,根据前一帧视频帧的动作框定位信息确定当前视频帧的第二待识别图像区域,将第二待识别图像区域输入第二动作识别子模型,获得当前视频帧的动作识别结果。
在一实施例中,基于当前视频帧不是预设关键帧的判断结果,在前一帧视频帧的动作框圈出的区域的基础上,扩大设定面积或设定像素点后,获得第二待识别图像区域,最后将第二待识别图像区域输入第二动作识别子模型,获得当前视频帧的动作识别结果。
本实施例的技术方案,将预设关键帧依次输入第一动作识别子模型和第二动作识别子模型获得动作识别结果,将非预设关键帧输入第二动作识别子模型获得动作识别结果,在保证识别准确性的基础上,可以提高图像识别的速率。
示例性的,图4a-4d为本申请一实施例提供的对“点赞”手势的识别效果图,如图4a-4d所示,其中“2”表示视频帧的手势类别为“点赞”。将每一视频帧输入动作识别模型后,可以获得该视频帧的手势类别为“2”,即点赞,以及手势定位信息,包括手势框的宽、手势框的高以及手势框的中心坐标。
图5a-5c为本申请一实施例提供的对“双手比心”手势的识别效果图,如图5a-5c所示,其中“5”表示视频帧的手势类别为“双手比心”。将每一视频帧输入动作识别模型后,可以获得该视频帧的手势类别为“5”,即双手比心,以及手势定位信息,包括手势框的宽、手势框的高以及手势框的中心坐标。
图6为本申请一实施例提供的一种视频动作的识别装置的结构示意图。如图6所示,该装置包括:动作类别和动作定位信息确定模块610和动作内容确定模块620。
动作类别和动作定位信息确定模块610,设置为根据当前视频帧和至少一个 前向视频帧,确定当前视频帧的动作类别和动作定位信息;
动作内容确定模块620,设置为根据视频帧的动作类别和动作定位信息,确定视频的动作内容。
在一实施例中,动作类别和动作定位信息确定模块610,设置为:
获取当前视频帧,确定当前视频帧的动作识别结果;其中,动作识别结果包括动作类别及动作定位信息;
根据至少一个前向视频帧的动作类别,对当前视频帧的动作类别进行修正,获取当前视频帧的目标动作类别;
根据当前视频帧的前一帧视频帧的动作定位信息,对当前视频帧的动作定位信息进行修正,获取当前视频帧的目标动作定位信息。
在一实施例中,动作类别和动作定位信息确定模块610,设置为:
将当前视频帧输入动作识别模型,获得至少一个设定动作类别的置信度;
选取置信度最高的设定动作类别,作为当前视频帧的动作类别。
在一实施例中,动作类别和动作定位信息确定模块610,设置为:
针对每个设定动作类别,将至少一个前向视频帧和当前视频帧中的该设定动作类别的置信度进行求和;
获取置信度的和值最高的设定动作类别;
在至少一个前向视频帧和当前视频帧的动作类别中,与置信度的和值最高的设定动作类别相同的数量超过设定数量的情况下,将置信度的和值最高的设定动作类别确定为目标动作类别;
在至少一个前向视频帧和当前视频帧的动作类别中,与置信度的和值最高的设定动作类别相同的数量未超过设定数量的情况下,将当前视频帧的动作类别确定为目标动作类别。
在一实施例中,动作定位信息包括动作框的宽、动作框的高以及动作框的中心坐标。
在一实施例中,动作类别和动作定位信息确定模块610,设置为:
对于动作框的宽或动作框的高,获取增益因子;
根据增益因子按照如下公式计算目标动作框的宽或目标动作框的高:
x=x2+k(x1-x2);
其中,x为目标动作框的宽或目标动作框的高,k为增益因子,x1为当前视频帧的动作框的宽或当前视频帧的动作框的高,x2为前一帧视频帧的动作框的宽或前一帧视频帧的动作框的高;
对于动作框的中心坐标,获取增益矩阵;
根据增益矩阵按照如下公式计算目标动作框的中心坐标:
Y=Y2+K*(Y1-H*Y2);
其中,Y为目标动作框的中心坐标,Y2为前一帧视频帧的动作框的中心坐标,K为增益矩阵,H为单位矩阵,Y1为当前视频帧的动作框的中心坐标。
在一实施例中,动作类别和动作定位信息确定模块610,设置为:
判断目标动作框的定位信息与前一帧视频帧的动作框的定位信息差值的绝对值是否小于设定阈值;
基于目标动作框的定位信息与前一帧视频帧的动作框的定位信息差值的绝对值小于设定阈值的判断结果,将目标动作框的定位信息更新为前一帧视频帧的动作框的定位信息。
在一实施例中,按照如下公式计算增益因子:
k=p -/(p -+r);
其中,p -=p+q,p为后验误差,p-为先验误差,q为过程偏差,r为测量偏差;
按照如下公式计算增益矩阵:
K=P -*H T*S -1
其中,P -=A*err*A T+Q,S=H*P -*H T+R,A为运动矩阵,Q为过程方差矩阵,R为测量方差矩阵,err为中心点误差矩阵,H为单位矩阵。
在一实施例中,动作类别和动作定位信息确定模块610,设置为:
判断当前视频帧是否为预设关键帧;
基于当前视频帧是预设关键帧的判断结果,将当前视频帧输入第一动作识别子模型获得当前视频帧的初始动作定位信息;根据初始动作定位信息确定当前视频帧的第一待识别图像区域,并将第一待识别图像区域输入第二动作识别子模型,获得当前视频帧的动作识别结果;其中,第一动作识别子模型和第二动作识别子模型采用不同的卷积神经网络训练获得;
基于当前视频帧不是预设关键帧的判断结果,根据前一帧视频帧的动作框定位信息确定当前视频帧的第二待识别图像区域,将第二待识别图像区域输入第二动作识别子模型,获得当前视频帧的动作识别结果。
在一实施例中,动作为用户的手势,动作类别为手势的形态,动作定位信息为手势的移动轨迹。
上述装置可执行本申请前述所有实施例所提供的方法,未在本实施例中详尽描述的技术细节,可参见本申请前述所有实施例所提供的方法。
图7为本申请一实施例提供的一种计算机设备的结构示意图。图7示出了适于用来实现本申请实施方式的计算机设备712的框图。图7显示的计算机设备712仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。设备712典型的是承担视频动作的识别功能的计算设备。
如图7所示,计算机设备712以通用计算设备的形式表现。计算机设备712 的组件可以包括但不限于:至少一个处理器716,存储装置728,连接不同系统组件(包括存储装置728和处理器716)的总线718。
总线718表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(Industry Standard Architecture,ISA)总线,微通道体系结构(Micro Channel Architecture,MCA)总线,增强型ISA总线、视频电子标准协会(Video Electronics Standards Association,VESA)局域总线以及外围组件互连(Peripheral Component Interconnect,PCI)总线。
计算机设备712典型地包括多种计算机系统可读介质。这些介质可以是任何能够被计算机设备712访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。
存储装置728可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(Random Access Memory,RAM)730和高速缓存存储器732中至少一种。计算机设备712可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统734可以用于读写不可移动的、非易失性磁介质(图7未显示,通常称为“硬盘驱动器”)。尽管图7中未示出,可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如只读光盘(Compact Disc-Read Only Memory,CD-ROM)、数字视盘(Digital Video Disc-Read Only Memory,DVD-ROM)或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过至少一个数据介质接口与总线718相连。存储装置728可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本申请实施例的功能。
具有一组(至少一个)程序模块726的程序736,可以存储在例如存储装置728中,这样的程序模块726包括但不限于操作系统、至少一个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块726通常执行本申请所描述的实施例中的功能和/或方法。
计算机设备712也可以与至少一个外部设备714(例如键盘、指向设备、摄像头、显示器724等)通信,还可与至少一个一个使得用户能与该计算机设备712交互的设备通信,和/或与使得该计算机设备712能与至少一个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(Input/Output,I/O)接口722进行。并且,计算机设备712还可以通过网络适配器720与至少一个网络(例如局域网(Local Area Network,LAN),广域网Wide Area Network,WAN)和/或公共网络,例如因特网)通信。如图所 示,网络适配器720通过总线718与计算机设备712的其它模块通信。应当明白,尽管图中未示出,可以结合计算机设备712使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、磁盘阵列(Redundant Arrays of Independent Disks,RAID)系统、磁带驱动器以及数据备份存储系统等。
处理器716通过运行存储在存储装置728中的程序,从而执行各种功能应用以及数据处理,例如实现本申请上述实施例所提供的视频动作的识别方法。
本申请实施例六还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时,实现如本申请实施例所提供的视频动作的识别方法。
当然,本申请实施例所提供的一种计算机可读存储介质,其上存储的计算机程序不限于如上所述的方法操作,还可以执行本申请任意实施例所提供的视频动作的识别方法中的相关操作。
本申请实施例的计算机存储介质,可以采用至少一个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有至少一个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括——但不限于无线、电线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言或其组合来编写用于执行本申请操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的 程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。

Claims (13)

  1. 一种视频动作的识别方法,包括:
    根据当前视频帧和至少一个前向视频帧,确定所述当前视频帧的动作类别和动作定位信息;
    根据视频帧的动作类别和动作定位信息,确定视频的动作内容。
  2. 根据权利要求1所述的方法,其中,根据当前视频帧和至少一个前向视频帧,确定所述当前视频帧的动作类别和动作定位信息,包括:
    获取当前视频帧,确定所述当前视频帧的动作识别结果;其中,所述动作识别结果包括动作类别及动作定位信息;
    根据至少一个前向视频帧的动作类别,对所述当前视频帧的动作类别进行修正,获取所述当前视频帧的目标动作类别;
    根据当前视频帧的前一帧视频帧的动作定位信息,对所述当前视频帧的动作定位信息进行修正,获取所述当前视频帧的目标动作定位信息。
  3. 根据权利要求2所述的方法,其中,获取当前视频帧,确定所述当前视频帧的动作类别,包括:
    将所述当前视频帧输入动作识别模型,获得至少一个设定动作类别的置信度;
    选取置信度最高的设定动作类别,作为所述当前视频帧的动作类别。
  4. 根据权利要求3所述的方法,其中,根据至少一个前向视频帧的动作类别,对所述当前视频帧的动作类别进行修正,获取所述当前视频帧的目标动作类别,包括:
    针对每个设定动作类别,将至少一个所述前向视频帧和所述当前视频帧中的该设定动作类别的置信度进行求和;
    获取置信度的和值最高的设定动作类别;
    在至少一个所述前向视频帧和所述当前视频帧的动作类别中,与置信度的和值最高的设定动作类别相同的数量超过设定数量的情况下,将置信度的和值最高的设定动作类别确定为目标动作类别;
    在至少一个所述前向视频帧和所述当前视频帧的动作类别中,与置信度的和值最高的设定动作类别相同的数量未超过设定数量的情况下,将所述当前视频帧的动作类别确定为目标动作类别。
  5. 根据权利要求2所述的方法,其中,所述动作定位信息包括动作框的宽、动作框的高以及动作框的中心坐标。
  6. 根据权利要求5所述的方法,其中,根据当前视频帧的前一帧视频帧的动作定位信息,对所述当前视频帧的动作定位信息进行修正,获取所述当前视频帧的目标动作定位信息,包括:
    对于动作框的宽或动作框的高,获取增益因子;
    根据所述增益因子按照如下公式计算目标动作框的宽或目标动作框的高:
    x=x2+k(x1-x2);
    其中,x为目标动作框的宽或目标动作框的高,k为增益因子,x1为当前视频帧的动作框的宽或当前视频帧的动作框的高,x2为前一帧视频帧的动作框的宽或前向视频帧的动作框的高;
    对于动作框的中心坐标,获取增益矩阵;
    根据所述增益矩阵按照如下公式计算目标动作框的中心坐标:
    Y=Y2+K*(Y1-H*Y2);
    其中,Y为目标动作框的中心坐标,Y2为前一帧视频帧的动作框的中心坐标,K为增益矩阵,H为单位矩阵,Y1为当前视频帧的动作框的中心坐标。
  7. 根据权利要求6所述的方法,还包括:
    判断所述目标动作框的定位信息与所述前一帧视频帧的动作框的定位信息差值的绝对值是否小于设定阈值;
    基于所述目标动作框的定位信息与所述前一帧视频帧的动作框的定位信息差值的绝对值小于设定阈值的判断结果,将所述目标动作框的定位信息更新为前一帧视频帧的动作框的定位信息。
  8. 根据权利要求6所述的方法,其中,所述增益因子按照如下公式计算:
    k=p -/(p -+r);
    其中,p -=p+q,p为后验误差,p-为先验误差,q为过程偏差,r为测量偏差;
    所述增益矩阵按照如下公式计算:
    K=P -*H T*S -1
    其中,P -=A*err*A T+Q,S=H*P -*H T+R,A为运动矩阵,Q为过程方差矩阵,R为测量方差矩阵,err为中心点误差矩阵,H为单位矩阵。
  9. 根据权利要求2所述的方法,其中,获取当前视频帧,确定所述当前视频帧的动作识别结果,包括:
    判断所述当前视频帧是否为预设关键帧;
    基于所述当前视频帧是预设关键帧的判断结果,将所述当前视频帧输入第一动作识别子模型获得所述当前视频帧的初始动作定位信息;根据所述初始动作定位信息确定所述当前视频帧的第一待识别图像区域,并将所述第一待识别图像区域输入第二动作识别子模型,获得所述当前视频帧的动作识别结果;其中,所述第一动作识别子模型和第二动作识别子模型采用不同的卷积神经网络训练获得;
    基于所述当前视频帧不是预设关键帧的判断结果,根据前一帧视频帧的动作框定位信息确定所述当前视频帧的第二待识别图像区域,将所述第二待识别 图像区域输入第二动作识别子模型,获得所述当前视频帧的动作识别结果。
  10. 根据权利要求1所述的方法,其中,所述动作为用户的手势,所述动作类别为手势的形态,所述动作定位信息为手势的移动轨迹。
  11. 一种视频动作的识别装置,包括:
    动作类别和动作定位信息确定模块,设置为根据当前视频帧和至少一个前向视频帧,确定所述当前视频帧的动作类别和动作定位信息;
    动作内容确定模块,设置为根据视频帧的所述动作类别和所述动作定位信息,确定视频的动作内容。
  12. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时,实现如权利要求1-10中任一项所述的方法。
  13. 一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时,实现如权利要求1-10中任一项所述的方法。
PCT/CN2019/102717 2018-09-21 2019-08-27 视频动作的识别方法、装置、设备及存储介质 WO2020057329A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19862600.4A EP3862914A4 (en) 2018-09-21 2019-08-27 VIDEO ACTION RECOGNITION METHOD, APPARATUS AND DEVICE, AND STORAGE MEDIA
US17/278,195 US20220130146A1 (en) 2018-09-21 2019-08-27 Method for recognizing video action, and device and storage medium thereof

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811107097.0 2018-09-21
CN201811107097.0A CN109344755B (zh) 2018-09-21 2018-09-21 视频动作的识别方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2020057329A1 true WO2020057329A1 (zh) 2020-03-26

Family

ID=65306546

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102717 WO2020057329A1 (zh) 2018-09-21 2019-08-27 视频动作的识别方法、装置、设备及存储介质

Country Status (4)

Country Link
US (1) US20220130146A1 (zh)
EP (1) EP3862914A4 (zh)
CN (1) CN109344755B (zh)
WO (1) WO2020057329A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033458A (zh) * 2021-04-09 2021-06-25 京东数字科技控股股份有限公司 动作识别方法和装置

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344755B (zh) * 2018-09-21 2024-02-13 广州市百果园信息技术有限公司 视频动作的识别方法、装置、设备及存储介质
CN109558832B (zh) 2018-11-27 2021-03-26 广州市百果园信息技术有限公司 一种人体姿态检测方法、装置、设备及存储介质
CN110163129B (zh) * 2019-05-08 2024-02-13 腾讯科技(深圳)有限公司 视频处理的方法、装置、电子设备及计算机可读存储介质
CN110543830B (zh) * 2019-08-12 2022-05-13 珠海格力电器股份有限公司 动作检测方法、装置、存储介质
EP4038541A4 (en) * 2019-09-19 2023-06-28 Arctan Analytics Pte. Ltd. System and method for assessing customer satisfaction from a physical gesture of a customer
CN110866478B (zh) * 2019-11-06 2022-04-29 支付宝(杭州)信息技术有限公司 一种图像中对象的识别方法、装置和设备
CN113038149A (zh) * 2019-12-09 2021-06-25 上海幻电信息科技有限公司 直播视频互动方法、装置以及计算机设备
CN112883782B (zh) * 2021-01-12 2023-03-24 上海肯汀通讯科技有限公司 投放行为识别方法、装置、设备及存储介质
CN113111939B (zh) * 2021-04-12 2022-09-02 中国人民解放军海军航空大学航空作战勤务学院 飞行器飞行动作识别方法及装置
CN113205067B (zh) * 2021-05-26 2024-04-09 北京京东乾石科技有限公司 作业人员监控方法、装置、电子设备和存储介质
KR102603423B1 (ko) * 2022-06-23 2023-11-20 주식회사 노타 뉴럴 네트워크 모델을 이용하여 이미지의 이벤트를 분류하는 장치 및 방법
WO2023249307A1 (ko) * 2022-06-23 2023-12-28 주식회사 노타 뉴럴 네트워크 모델을 이용하여 이미지 이벤트 분류를 결정하는 장치 및 방법
CN117369649B (zh) * 2023-12-05 2024-03-26 山东大学 基于本体感觉的虚拟现实交互系统及方法

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020648A (zh) * 2013-01-09 2013-04-03 北京东方艾迪普科技发展有限公司 一种动作类型识别方法、节目播出方法及装置
CN104049760A (zh) * 2014-06-24 2014-09-17 深圳先进技术研究院 一种人机交互命令的获取方法及系统
US9600717B1 (en) * 2016-02-25 2017-03-21 Zepp Labs, Inc. Real-time single-view action recognition based on key pose analysis for sports videos
CN107766839A (zh) * 2017-11-09 2018-03-06 清华大学 基于神经网络的动作识别方法和装置
CN107786549A (zh) * 2017-10-16 2018-03-09 北京旷视科技有限公司 音频文件的添加方法、装置、系统及计算机可读介质
CN108181989A (zh) * 2017-12-29 2018-06-19 北京奇虎科技有限公司 基于视频数据的手势控制方法及装置、计算设备
CN108229277A (zh) * 2017-03-31 2018-06-29 北京市商汤科技开发有限公司 手势识别、控制及神经网络训练方法、装置及电子设备
CN109344755A (zh) * 2018-09-21 2019-02-15 广州市百果园信息技术有限公司 视频动作的识别方法、装置、设备及存储介质

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3607440B2 (ja) * 1996-12-03 2005-01-05 日本電気株式会社 ジェスチャー認識方法
US6683968B1 (en) * 1999-09-16 2004-01-27 Hewlett-Packard Development Company, L.P. Method for visual tracking using switching linear dynamic system models
JP2012068713A (ja) * 2010-09-21 2012-04-05 Sony Corp 情報処理装置、および情報処理方法
CN103077532A (zh) * 2012-12-24 2013-05-01 天津市亚安科技股份有限公司 一种实时视频目标快速跟踪方法
RU2013146529A (ru) * 2013-10-17 2015-04-27 ЭлЭсАй Корпорейшн Распознавание динамического жеста руки с избирательным инициированием на основе обнаруженной скорости руки
US10115005B2 (en) * 2016-08-12 2018-10-30 Qualcomm Incorporated Methods and systems of updating motion models for object trackers in video analytics
CN106780620B (zh) * 2016-11-28 2020-01-24 长安大学 一种乒乓球运动轨迹识别定位与跟踪系统及方法
US10262226B1 (en) * 2017-05-16 2019-04-16 State Farm Mutual Automobile Insurance Company Systems and methods regarding 2D image and 3D image ensemble prediction models
CN108241849B (zh) * 2017-08-28 2021-09-07 北方工业大学 基于视频的人体交互动作识别方法
CN107786848A (zh) * 2017-10-30 2018-03-09 周燕红 运动目标检测与动作识别的方法、装置、终端及存储介质
WO2019120290A1 (zh) * 2017-12-22 2019-06-27 北京市商汤科技开发有限公司 动态手势识别方法和装置、手势交互控制方法和装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020648A (zh) * 2013-01-09 2013-04-03 北京东方艾迪普科技发展有限公司 一种动作类型识别方法、节目播出方法及装置
CN104049760A (zh) * 2014-06-24 2014-09-17 深圳先进技术研究院 一种人机交互命令的获取方法及系统
US9600717B1 (en) * 2016-02-25 2017-03-21 Zepp Labs, Inc. Real-time single-view action recognition based on key pose analysis for sports videos
CN108229277A (zh) * 2017-03-31 2018-06-29 北京市商汤科技开发有限公司 手势识别、控制及神经网络训练方法、装置及电子设备
CN107786549A (zh) * 2017-10-16 2018-03-09 北京旷视科技有限公司 音频文件的添加方法、装置、系统及计算机可读介质
CN107766839A (zh) * 2017-11-09 2018-03-06 清华大学 基于神经网络的动作识别方法和装置
CN108181989A (zh) * 2017-12-29 2018-06-19 北京奇虎科技有限公司 基于视频数据的手势控制方法及装置、计算设备
CN109344755A (zh) * 2018-09-21 2019-02-15 广州市百果园信息技术有限公司 视频动作的识别方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3862914A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033458A (zh) * 2021-04-09 2021-06-25 京东数字科技控股股份有限公司 动作识别方法和装置
CN113033458B (zh) * 2021-04-09 2023-11-07 京东科技控股股份有限公司 动作识别方法和装置

Also Published As

Publication number Publication date
CN109344755B (zh) 2024-02-13
US20220130146A1 (en) 2022-04-28
CN109344755A (zh) 2019-02-15
EP3862914A4 (en) 2022-01-19
EP3862914A1 (en) 2021-08-11

Similar Documents

Publication Publication Date Title
WO2020057329A1 (zh) 视频动作的识别方法、装置、设备及存储介质
US10950271B1 (en) Method for triggering events in a video
WO2021115181A1 (zh) 手势识别方法、手势控制方法、装置、介质与终端设备
US10983596B2 (en) Gesture recognition method, device, electronic device, and storage medium
WO2019233421A1 (zh) 图像处理方法及装置、电子设备、存储介质
WO2019024808A1 (zh) 语义分割模型的训练方法和装置、电子设备、存储介质
WO2022142009A1 (zh) 一种模糊图像修正的方法、装置、计算机设备及存储介质
US9355333B2 (en) Pattern recognition based on information integration
US11017253B2 (en) Liveness detection method and apparatus, and storage medium
US20210343042A1 (en) Audio acquisition device positioning method and apparatus, and speaker recognition method and system
US11636666B2 (en) Method and apparatus for identifying key point locations in image, and medium
JP2019506672A (ja) 多重オブジェクト構造を認識するためのシステムおよび方法
CN113780326A (zh) 一种图像处理方法、装置、存储介质及电子设备
CN116012913A (zh) 模型训练方法、人脸关键点检测方法、介质及装置
US10766143B1 (en) Voice controlled keyboard typing using computer vision
CN111815748B (zh) 一种动画处理方法、装置、存储介质及电子设备
CN111126101B (zh) 关键点位置的确定方法、装置、电子设备和存储介质
CN114926322B (zh) 图像生成方法、装置、电子设备和存储介质
CN114461078B (zh) 一种基于人工智能的人机交互方法
WO2022194130A1 (zh) 字符位置修正方法、装置、电子设备和存储介质
CN110263743B (zh) 用于识别图像的方法和装置
CN111353464B (zh) 一种物体检测模型训练、物体检测方法及装置
CN111696157B (zh) 图像重定位的确定方法、系统、设备和存储介质
CN116416159A (zh) 图像矫正方法及装置、电子设备和介质
CN114942697A (zh) 一种识别交互内容的方法、识别装置和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19862600

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019862600

Country of ref document: EP

Effective date: 20210421