WO2019091417A1 - 基于神经网络的识别方法与装置 - Google Patents

基于神经网络的识别方法与装置 Download PDF

Info

Publication number
WO2019091417A1
WO2019091417A1 PCT/CN2018/114487 CN2018114487W WO2019091417A1 WO 2019091417 A1 WO2019091417 A1 WO 2019091417A1 CN 2018114487 W CN2018114487 W CN 2018114487W WO 2019091417 A1 WO2019091417 A1 WO 2019091417A1
Authority
WO
WIPO (PCT)
Prior art keywords
action
identified video
frame
neural network
category
Prior art date
Application number
PCT/CN2018/114487
Other languages
English (en)
French (fr)
Inventor
季向阳
吴嘉林
杨武魁
王谷
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Priority to JP2020524869A priority Critical patent/JP6920771B2/ja
Publication of WO2019091417A1 publication Critical patent/WO2019091417A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Definitions

  • the present disclosure relates to the field of neural network technologies, and in particular, to a neural network-based identification method and apparatus.
  • a typical two-dimensional time-added motion positioning frame detects a moving person in each frame, and then connects the detected persons together between different frames to form a Action example.
  • These algorithms can only consider the appearance features and motion features in one frame when detecting people, which leads to a large reduction of the time receptive domain of the neural network, and the action with small motion amplitude is difficult to separate from the background.
  • the detection frame since the algorithm is performed frame by frame, the detection frame must pass through the network separately. This leads to a significant increase in computational overhead.
  • multiple action instances make the responses in the returned score maps overlap, resulting in a common three-dimensional motion location method that is difficult to locate multiple action instances.
  • the present disclosure proposes a neural network based motion recognition method and apparatus for improving the accuracy and detection efficiency of a neural network based motion recognition method.
  • a neural network based motion recognition method comprising:
  • the action extraction result of the to-be-identified video includes:
  • An action start frame and an action end frame in the to-be-identified video are combined.
  • determining, according to the action extraction result of the to-be-identified video, the action instance detection result of the to-be-identified video including:
  • the action category discriminating result of the to-be-identified video includes:
  • the action category probability corresponding to the pixel on each frame of the image is the action category probability corresponding to the pixel on each frame of the image.
  • determining an action category of the to-be-identified video according to the action instance detection result and the action category determination result of the to-be-identified video including:
  • the to-be-identified video is input into the trained first three-dimensional neural network model to obtain an action extraction result of the to-be-identified video, including:
  • the feature value is input into the trained second three-dimensional neural network model for processing, and the action category discrimination result of the to-be-identified video is obtained.
  • a neural network based motion recognition apparatus including:
  • a first three-dimensional identification module configured to process the to-be-identified video into the trained first three-dimensional neural network model, to obtain an action extraction result of the to-be-identified video
  • the action extraction result processing module is configured to determine an action instance detection result of the to-be-identified video according to the action extraction result of the to-be-identified video;
  • a second three-dimensional identification module configured to input the to-be-identified video into the trained second three-dimensional neural network model, to obtain an action category discrimination result of the to-be-identified video
  • the action category determining module is configured to determine an action category of the to-be-identified video according to the action instance detection result and the action category determination result of the to-be-identified video.
  • the action extraction result of the to-be-identified video includes:
  • An action start frame and an action end frame in the to-be-identified video are combined.
  • the action extraction result processing module includes:
  • a motion detection frame calculation submodule configured to calculate an action detection frame in each frame image according to a first probability of each frame image in the to-be-identified video, and an action start frame and an action end frame in the to-be-identified video ;
  • a matching value calculation submodule configured to calculate a detection frame matching value between each frame of the image according to the motion detection frame
  • the action instance determining submodule is configured to determine an action instance detection frame of the to-be-identified video according to the detection frame matching value.
  • the action category discriminating result of the to-be-identified video includes:
  • the action category probability corresponding to the pixel on each frame of the image is the action category probability corresponding to the pixel on each frame of the image.
  • the action category determining module includes:
  • a first action category determining submodule configured to determine, in an action class probability corresponding to a pixel on each frame image, an action category corresponding to a pixel in the action instance detection frame;
  • the second action category determining submodule is configured to determine an action category of the to-be-identified video according to an action category corresponding to the pixel in the action instance detection frame.
  • the first three-dimensional identification module includes:
  • a first two-dimensional identification sub-module configured to input the to-be-identified video into the trained two-dimensional neural network model to obtain a feature value
  • a first three-dimensional identification sub-module configured to input the feature value into the trained first three-dimensional neural network model, to obtain an action extraction result of the to-be-identified video
  • the second three-dimensional identification module includes:
  • a second two-dimensional identification sub-module configured to input the to-be-identified video into the trained two-dimensional neural network model to obtain a feature value
  • a second three-dimensional identification sub-module configured to input the feature value into the trained second three-dimensional neural network model to obtain an action category discrimination result of the to-be-identified video.
  • a neural network based motion recognition apparatus including:
  • a memory for storing processor executable instructions
  • the processor is configured to perform the above neural network based motion recognition method.
  • a non-transitory computer readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the neural network-based motion recognition method described above.
  • the video to be recognized is processed by using two three-dimensional neural network models, and the action extraction result and the action category discrimination result are respectively obtained. After determining the action instance in the to-be-identified video according to the action extraction result, determining the action type in the to-be-identified video together with the action category determination result.
  • FIG. 1 shows a flow chart of a neural network based motion recognition method in accordance with an embodiment of the present disclosure
  • FIG. 2 illustrates a flow chart of a neural network based motion recognition method in accordance with an embodiment of the present disclosure
  • FIG. 3 illustrates a flowchart of a neural network based motion recognition method according to an embodiment of the present disclosure
  • FIG. 4 illustrates a flowchart of a neural network based motion recognition method according to an embodiment of the present disclosure
  • FIG. 5 is a schematic diagram of generating a single frame motion detection frame in a neural network based motion recognition method according to an embodiment of the present disclosure
  • FIG. 6 is a schematic diagram of generating an action instance detection frame in a neural network based motion recognition method according to an embodiment of the present disclosure
  • FIG. 7 illustrates a schematic diagram of determining an action category of a video to be identified in a neural network based motion recognition method according to an embodiment of the present disclosure
  • FIG. 8 illustrates a block diagram of a neural network based motion recognition apparatus according to an embodiment of the present disclosure
  • FIG. 9 illustrates a block diagram of a neural network based motion recognition apparatus according to an embodiment of the present disclosure
  • FIG. 10 illustrates a block diagram of a neural network based motion recognition apparatus in accordance with an embodiment of the present disclosure.
  • FIG. 1 is a flowchart of a neural network-based motion recognition method according to an embodiment of the present disclosure.
  • the neural network-based motion recognition method shown in FIG. 1 includes:
  • step S10 the to-be-identified video is input into the trained first three-dimensional neural network model to obtain an action extraction result of the to-be-identified video.
  • the to-be-identified video includes a video composed of consecutive image frames, and the characters in the image are performing certain action categories, such as long jump, basketball, singing, and the like.
  • the first three-dimensional neural network model includes a 3D convolutional neural network model, which is composed of a multi-layer 3D convolution layer and a multi-layer 3D pooling layer, and is modeled according to spatial information and time information in the video to be identified.
  • the spatial information includes pixels on each frame of the image, and the time information includes time information in the video stream.
  • the motion extraction result includes the decomposition action feature extracted in the video to be identified.
  • Step S20 Determine an action instance detection result of the to-be-identified video according to the action extraction result of the to-be-identified video.
  • the action instance includes a decomposition action with a time series feature, and the multiple action instances form an action category.
  • the long jump includes three examples of running, starting, and falling. According to the decomposition action feature in the action extraction result, each action instance in the to-be-identified video is determined.
  • step S30 the to-be-identified video is input into the trained second three-dimensional neural network model to obtain an action category discrimination result of the to-be-identified video.
  • the second three-dimensional neural network model includes a 3D convolutional neural network model, which is composed of a multi-layer 3D convolution layer and a multi-layer 3D pooling layer, according to spatial information and time information in the video to be identified. Modeling.
  • the spatial information includes pixels on each frame of the image, and the time information includes time information in the video stream.
  • the action category discrimination result includes the action category feature extracted in the video to be identified.
  • step S30 can be performed simultaneously with step S10, before step S10, or after step S10.
  • Step S40 Determine an action category of the to-be-identified video according to the action instance detection result of the to-be-identified video and the action category determination result.
  • the two-dimensional neural network model is used to process the identified video, and the action extraction result and the action category discrimination result are respectively obtained. After determining the action instance in the to-be-identified video according to the action extraction result, determining the action type in the to-be-identified video together with the action category determination result.
  • FIG. 2 shows a flowchart of a neural network based motion recognition method according to an embodiment of the present disclosure, as shown in FIG. 2, based on the embodiment shown in FIG.
  • the action extraction result of the to-be-identified video in step S10 includes: a first probability that each frame of the to-be-identified video belongs to one action position in an action instance, and an action start frame in the to-be-identified video And the end of the action frame.
  • the motion start frame includes a start frame of an action instance and also includes a start frame of an action class.
  • the action end frame includes an end frame of the action category, and also includes an end frame of an action category.
  • An action instance is a decomposition action that continues in time.
  • Each decomposition action has a plurality of consecutive action positions, and extracts action positions with distinct features in each decomposition action, so that a more accurate action instance can be obtained in subsequent analysis.
  • the example of the take-off action in the action category of the long jump includes at least five action positions of the feet off the ground, jumping, jumping to the highest point, falling, and landing on both feet.
  • the sample video used in the training process of the first three-dimensional neural network model identifies an action start frame, an action end frame, and a preset action position in an action instance that identifies that the action in the image belongs to a preset.
  • the action extraction result obtained by using the trained first three-dimensional neural network model includes: an action start frame, an action end frame in the to-be-identified video, and each frame image in the to-be-identified video belongs to an action instance.
  • the first probability of an action position For example, the video to be identified includes 20 frames of images, wherein the sixth frame is the motion start frame, the 20th frame is the motion end frame, the sixth frame has a probability of 60%, and the 12th frame is the highest point. The probability is 70% and so on.
  • Step S20 includes:
  • Step S21 Calculate an action detection frame in each frame image according to a first probability of each frame image in the to-be-identified video, and an action start frame and an action end frame in the to-be-identified video.
  • FIG. 5 is a schematic diagram of generating a single frame motion detection frame in a neural network based motion recognition method according to an embodiment of the present disclosure, as shown in FIG. 5, according to the first image of each frame. Probability, the range of pixels in which the action position is estimated in each frame of image, that is, the motion detection frame is obtained. For example, in the image on the right side of FIG. 5, it can be inferred that the action position is a raise hand according to the pixel mainly including the arm part in the motion detection frame.
  • Step S22 calculating a detection frame matching value between each frame of the image according to the motion detection frame.
  • the detection frame matching value in each frame image may be calculated according to the probability of the motion position in the motion detection frame in each frame of the image.
  • the action in the action detection frame in the last frame of the action instance 1 and the action in the action detection frame in the penultimate frame of the action instance 1 are A; the action detection frame in the last frame of the action instance 1 The action is matched with the action of the action in the action detection frame in the first frame of action instance 2.
  • Step S23 Determine an action instance detection frame of the to-be-identified video according to the detection frame matching value.
  • the motion of each frame of the same action instance in the motion detection frame has a stronger correlation. It is easy to understand that the degree of matching between actions between different action instances is low. Therefore, the matching value A in the above example is larger than B.
  • the action instance detection frame of each action instance is determined according to the action detection frame in each action instance.
  • FIG. 6 is a schematic diagram of generating an action instance detection frame in a neural network-based motion recognition method according to an embodiment of the present disclosure.
  • the left four-frame image belongs to one action instance 1.
  • the action instance detection frame on the right side is determined, and the action example detection frame includes the motion detection frame of all the frames in the action instance.
  • the motion extraction result given by the first three-dimensional neural network model includes the probability of the action position of the action instance to which each frame of the image to be identified belongs, and the ability to distinguish different action instances is enhanced to enable subsequent
  • the decision process of the action category is more accurate.
  • each action instance detection frame in the to-be-identified video is determined. After each action instance in the to-be-identified video is determined, the positioning accuracy of the action category can be improved in the process of identifying the subsequent action category.
  • FIG. 3 is a flowchart of a motion recognition method based on a neural network according to an embodiment of the present disclosure.
  • the second three-dimensional neural network model directly gives the action category probability corresponding to each pixel on each frame of the image to be identified.
  • the action type corresponding to the pixel 1 in the first frame image is that the probability of singing is 0.3, the probability of running is 0.5, and the probability of kicking is 0.2.
  • the action type corresponding to pixel 2 is that the probability of kicking is 0.1, the probability of running is 0.1, and the probability of singing is 0.8.
  • Step S40 includes:
  • Step S41 Determine an action category corresponding to the pixel in the action instance detection frame in the action category probability corresponding to the pixel on each frame image.
  • Step S42 Determine an action category of the to-be-identified video according to an action category corresponding to the pixel in the action instance detection frame.
  • FIG. 7 is a schematic diagram of determining an action category of a video to be recognized in a neural network based motion recognition method according to an embodiment of the present disclosure.
  • the upper left small cube is a processing result according to the first three-dimensional neural network model.
  • the obtained action instance detection frame, the spatial position of the action instance detection frame, is a cube composed of an action start frame, an action end frame, and a motion detection frame, and determines a range of values for determining the action category in the to-be-identified video.
  • the large cube on the lower left side is processed by the first three-dimensional neural network model, and each pixel carries video information of the action category probability.
  • the determination range of the action category is determined according to the small cube on the upper left side in the large cube on the lower left side, and finally the sum of the action categories of the pixels in the right small cube is obtained.
  • the action type having the highest probability among the action categories of the right small cube is determined as the action type of the video to be recognized.
  • the action category discrimination result given by the second three-dimensional neural network model by the action category discrimination result given by the second three-dimensional neural network model, the probability of the action category corresponding to the pixel on each frame image can be obtained, and the action category is determined for each pixel. Therefore, the action category recognition result of the subsequent entire to-be-identified video is more accurate. And through the two-dimensional neural network model, the time information and the spatial information are simultaneously modeled, so that the positioning of the action is more robust, and the frame of the action is detected by the extracted action instance, thereby avoiding calculating each frame image one by one.
  • the burden of the action category feature reduces the amount of computation of the motion recognition, and at the same time, the beginning of the action, the end of the action, and the modeling of the action at the specific action position enhance the ability to distinguish different action instances, making the result of the motion recognition more accurate. .
  • FIG. 4 shows a flowchart of a neural network based motion recognition method according to an embodiment of the present disclosure, as shown in FIG. 4, based on the embodiment shown in FIG.
  • Step S10 comprising:
  • Step 101 Input the to-be-identified video into the trained two-dimensional neural network model to obtain a feature value.
  • Step 102 The feature value is input into the trained first three-dimensional neural network model for processing, and the action extraction result of the to-be-identified video is obtained.
  • Step S30 comprising:
  • Step 301 Input the to-be-identified video into the trained two-dimensional neural network model to obtain a feature value.
  • Step 302 The feature value is input into the trained second three-dimensional neural network model for processing, and the action category discrimination result of the to-be-identified video is obtained.
  • the video to be identified is input into a two-dimensional neural network model for processing, and a more generalized feature expression is obtained, and the feature values are extracted, and then input into the first three-dimensional neural network model and the second three-dimensional neural network model respectively. deal with.
  • the video to be identified is first input into the two-dimensional neural network model, and after the feature is extracted, the processing efficiency of the three-dimensional neural network model can be improved, thereby improving the determination efficiency of the action category of the video to be identified.
  • FIG. 8 is a block diagram of a neural network-based motion recognition apparatus according to an embodiment of the present disclosure. As shown in FIG. 8, the neural network-based motion recognition apparatus provided in this embodiment includes:
  • the first three-dimensional identification module 41 is configured to process the first three-dimensional neural network model trained by the video to be recognized, to obtain an action extraction result of the to-be-identified video;
  • the action extraction result processing module 42 is configured to determine an action instance detection result of the to-be-identified video according to the action extraction result of the to-be-identified video;
  • the second three-dimensional identification module 43 is configured to process the to-be-identified video into the trained second three-dimensional neural network model to obtain an action category discrimination result of the to-be-identified video;
  • the action category determining module 44 is configured to determine an action category of the to-be-identified video according to the action instance detection result of the to-be-identified video and the action category determination result of the to-be-identified video.
  • FIG. 9 is a block diagram showing a neural network based motion recognition apparatus according to an embodiment of the present disclosure. As shown in FIG. 9, on the basis of the apparatus shown in FIG.
  • the motion extraction result of the to-be-identified video includes: a first probability that each frame of the to-be-identified video belongs to one action position in an action instance, and an action start frame and an action end in the to-be-identified video frame.
  • the action extraction result processing module 42 includes:
  • the motion detection frame calculation sub-module 421 is configured to calculate motion detection in each frame image according to a first probability of each frame image in the to-be-identified video, and an action start frame and an action end frame in the to-be-identified video. frame;
  • the matching value calculation sub-module 422 is configured to calculate a detection frame matching value between each frame of the image according to the motion detection frame;
  • the action instance determining sub-module 423 is configured to determine an action instance detection frame of the to-be-identified video according to the detection frame matching value.
  • the action category discriminating result of the to-be-identified video includes: an action category probability corresponding to a pixel on each frame of the image.
  • the action category determining module 44 includes:
  • a first action category determining sub-module 441, configured to determine, in an action class probability corresponding to a pixel on each frame image, an action category corresponding to a pixel in the action instance detection frame;
  • the second action category determining sub-module 442 is configured to determine an action category of the to-be-identified video according to an action category corresponding to the pixel in the action instance detection frame.
  • the first three-dimensional identification module 41 includes:
  • a first two-dimensional identification sub-module 411 configured to input the to-be-identified video into the trained two-dimensional neural network model to obtain a feature value
  • a first three-dimensional identification sub-module 412 configured to input the feature value into the trained first three-dimensional neural network model, to obtain an action extraction result of the to-be-identified video
  • the second three-dimensional identification module 43 includes:
  • a second two-dimensional identification sub-module 431, configured to input the to-be-identified video into the trained two-dimensional neural network model to obtain a feature value
  • the second three-dimensional recognition sub-module 432 is configured to input the feature value into the trained second three-dimensional neural network model to obtain an action category discrimination result of the to-be-identified video.
  • FIG. 10 is a block diagram of a neural network based motion recognition apparatus 1900, according to an exemplary embodiment.
  • device 1900 can be provided as a server.
  • apparatus 1900 includes a processing component 1922 that further includes one or more processors, and memory resources represented by memory 1932 for storing instructions executable by processing component 1922, such as an application.
  • An application stored in memory 1932 can include one or more modules each corresponding to a set of instructions.
  • processing component 1922 is configured to execute instructions to perform the methods described above.
  • Apparatus 1900 can also include a power supply component 1926 configured to perform power management of apparatus 1900, a wired or wireless network interface 1950 configured to connect apparatus 1900 to the network, and an input/output (I/O) interface 1958.
  • Device 1900 can operate based on an operating system stored in memory 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
  • a non-transitory computer readable storage medium such as a memory 1932 comprising computer program instructions executable by processing component 1922 of apparatus 1900 to perform the above method.
  • the present disclosure can be a system, method, and/or computer program product.
  • the computer program product can comprise a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.
  • the computer readable storage medium can be a tangible device that can hold and store the instructions used by the instruction execution device.
  • the computer readable storage medium can be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Non-exhaustive list of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) Or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory stick, floppy disk, mechanical encoding device, for example, with instructions stored thereon A raised structure in the hole card or groove, and any suitable combination of the above.
  • a computer readable storage medium as used herein is not to be interpreted as a transient signal itself, such as a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagating through a waveguide or other transmission medium (eg, a light pulse through a fiber optic cable), or through a wire The electrical signal transmitted.
  • the computer readable program instructions described herein can be downloaded from a computer readable storage medium to various computing/processing devices or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in each computing/processing device .
  • Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine related instructions, microcode, firmware instructions, state setting data, or in one or more programming languages.
  • Source code or object code written in any combination including object oriented programming languages such as Smalltalk, C++, etc., as well as conventional procedural programming languages such as the "C" language or similar programming languages.
  • the computer readable program instructions can execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer, partly on the remote computer, or entirely on the remote computer or server. carried out.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or wide area network (WAN), or can be connected to an external computer (eg, using an Internet service provider to access the Internet) connection).
  • the customized electronic circuit such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA), can be customized by utilizing state information of computer readable program instructions.
  • Computer readable program instructions are executed to implement various aspects of the present disclosure.
  • the computer readable program instructions can be provided to a general purpose computer, a special purpose computer, or a processor of other programmable data processing apparatus to produce a machine such that when executed by a processor of a computer or other programmable data processing apparatus Means for implementing the functions/acts specified in one or more of the blocks of the flowcharts and/or block diagrams.
  • the computer readable program instructions can also be stored in a computer readable storage medium that causes the computer, programmable data processing device, and/or other device to operate in a particular manner, such that the computer readable medium storing the instructions includes An article of manufacture that includes instructions for implementing various aspects of the functions/acts recited in one or more of the flowcharts.
  • the computer readable program instructions can also be loaded onto a computer, other programmable data processing device, or other device to perform a series of operational steps on a computer, other programmable data processing device or other device to produce a computer-implemented process.
  • instructions executed on a computer, other programmable data processing apparatus, or other device implement the functions/acts recited in one or more of the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagram can represent a module, a program segment, or a portion of an instruction that includes one or more components for implementing the specified logical functions.
  • Executable instructions can also occur in a different order than those illustrated in the drawings. For example, two consecutive blocks may be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts can be implemented in a dedicated hardware-based system that performs the specified function or function. Or it can be implemented by a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Computational Linguistics (AREA)
  • Psychiatry (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本公开涉及一种基于神经网络的识别方法与装置,所述方法包括:将待识别视频输入训练好的第一三维神经网络模型进行处理,得到所述待识别视频的动作提取结果;根据所述待识别视频的动作提取结果,确定所述待识别视频的动作实例检测结果;将所述待识别视频输入训练好的第二三维神经网络模型进行处理,得到所述待识别视频的动作类别判别结果;根据所述待识别视频的动作实例检测结果和所述待识别视频的动作类别判别结果,确定所述待识别视频的动作类别。利用两个三维神经网络模型得到的不同的识别结果进行结合,可以提高三维神经网络模型的识别效率,减小单个三维神经网络模型的计算量。

Description

基于神经网络的识别方法与装置 技术领域
本公开涉及神经网络技术领域,尤其涉及一种基于神经网络的识别方法与装置。
背景技术
动作定位一般分为两种,一种是仅在空间中定位,另一种是时空同时定位。在较长且有多个动作执行人同时做动作的视频中,不同的动作实例之间相互影响,重叠。由于神经网络得到的是关于类别的泛化表达,所以传统基于神经网络的定位方法难以区分这些相互交叠动作。
在传统的动作定位方法中,一种典型的二维加时间的动作定位框架是在每一帧中检测在移动的人,再在不同的帧间将这些检测出来的人连接在一起,形成一个动作实例。这些算法在检测人的时候仅能够考虑到一帧中的外表特征和动作特征,就导致了神经网络的时间感受域大大减小,动作幅度小的动作与背景难以分开。另外,在评判每一个检测人的检测框时候,由于算法逐帧进行,所以检测框都必须单独通过网络。这导致了计算消耗大大增加。另外,多个动作实例使得在回归出来的得分图中的响应是交叠在一起的,导致普通的三维动作定位的方法难以对多个动作实例进行定位。
发明内容
有鉴于此,本公开提出了一种基于神经网络的动作识别方法和装置,用以提高基于神经网络的动作识别方法的准确率和检测效率。
根据本公开的另一方面,提供了一种基于神经网络的动作识别方法,所述方法包括:
将待识别视频输入训练好的第一三维神经网络模型进行处理,得到所述待识别视频的动作提取结果;
根据所述待识别视频的动作提取结果,确定所述待识别视频的动作实例检测结果;
将所述待识别视频输入训练好的第二三维神经网络模型进行处理,得到所述待识别视频的动作类别判别结果;
根据所述待识别视频的动作实例检测结果和动作类别判别结果,确定所述待识别视频的动作类别。
在一种可能的实现方式中,所述待识别视频的动作提取结果,包括;
所述待识别视频中每帧图像属于一个动作实例中的一个动作位置的第一概率,以及
所述待识别视频中的动作起始帧和动作结束帧。
在一种可能的实现方式中,根据所述待识别视频的动作提取结果,确定所述待识别视频的动作实例检测结果,包括:
根据所述待识别视频中每帧图像的第一概率,以及所述待识别视频中的动作起始帧和动作结束帧,计算每帧图像中的动作检测框;
根据所述动作检测框计算每帧图像间的检测框匹配值;
根据所述检测框匹配值,确定所述待识别视频的动作实例检测框。
在一种可能的实现方式中,所述待识别视频的动作类别判别结果,包括:
每帧图像上的像素所对应的动作类别概率。
在一种可能的实现方式中,根据所述待识别视频的动作实例检测结果和动作类别判别结果,确定所述待识别视频的动作类别,包括:
在每帧图像上的像素所对应的动作类别概率中,确定所述动作实例检测框中的像素所对应的动作类别;
根据所述动作实例检测框中的像素所对应的动作类别,确定所述待识别视频的动作类别。
在一种可能的实现方式中,将所述待识别视频输入训练好的第一三维神经网络模型进行处理,得到所述待识别视频的动作提取结果,包括:
将所述待识别视频输入训练好的二维神经网络模型,获取特征值;
将所述特征值输入训练好的第一三维神经网络模型进行处理,得到所述待识别视频的动作提取结果;
将所述待识别视频输入训练好的第二三维神经网络模型进行处理,得到所述待识别视频的动作类别判别结果,包括:
将所述待识别视频输入训练好的二维神经网络模型,获取特征值;
将所述特征值输入训练好的第二三维神经网络模型进行处理,得到所述待识别视频的动作类别判别结果。
根据本公开的另一方面,提供了一种基于神经网络的动作识别装置,包括:
第一三维识别模块,用于将待识别视频输入训练好的第一三维神经网络模型进行处理,得到所述待识别视频的动作提取结果;
动作提取结果处理模块,用于根据所述待识别视频的动作提取结果,确定所述待识别视频的动作实例检测结果;
第二三维识别模块,用于将所述待识别视频输入训练好的第二三维神经网络模型进行处理,得到所述待识别视频的动作类别判别结果;
动作类别确定模块,用于根据所述待识别视频的动作实例检测结果和动作类别判别结果,确定所述待识别视频的动作类别。
在一种可能的实现方式中,所述待识别视频的动作提取结果,包括;
所述待识别视频中每帧图像属于一个动作实例中的一个动作位置的第一概率,以及
所述待识别视频中的动作起始帧和动作结束帧。
在一种可能的实现方式中,所述动作提取结果处理模块,包括:
动作检测框计算子模块,用于根据所述待识别视频中每帧图像的第一概率,以及所述待识别视频中的动作起始帧和动作结束帧,计算每帧图像中的动作检测框;
匹配值计算子模块,用于根据所述动作检测框计算每帧图像间的检测框匹配值;
动作实例确定子模块,用于根据所述检测框匹配值,确定所述待识别视频的动作实例检测框。
在一种可能的实现方式中,所述待识别视频的动作类别判别结果,包括:
每帧图像上的像素所对应的动作类别概率。
在一种可能的实现方式中,所述动作类别确定模块,包括:
第一动作类别确定子模块,用于在每帧图像上的像素所对应的动作类别概率中,确定所述动作实例检测框中的像素所对应的动作类别;
第二动作类别确定子模块,用于根据所述动作实例检测框中的像素所对应的动作类别,确定所述待识别视频的动作类别。
在一种可能的实现方式中,所述第一三维识别模块,包括:
第一二维识别子模块,用于将所述待识别视频输入训练好的二维神经网络模型,获取特征值;
第一三维识别子模块,用于将所述特征值输入训练好的第一三维神经网络模型进行处理,得到所述待识别视频的动作提取结果;
所述第二三维识别模块,包括:
第二二维识别子模块,用于将所述待识别视频输入训练好的二维神经网络模型,获取特征值;
第二三维识别子模块,用于将所述特征值输入训练好的第二三维神经网络模型进行处理,得到所述待识别视频的动作类别判别结果。
根据本公开的另一方面,提供了一种基于神经网络的动作识别装置,包括:
处理器;
用于存储处理器可执行指令的存储器;
其中,所述处理器被配置为执行上述基于神经网络的动作识别方法。
根据本公开的另一方面,提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述基于神经网络的动作识别方法。
根据本公开的实施例,利用两个三维神经网络模型对待识别视频进行处理,分别得到动作提取结果和动作类别判别结果。其中,根据动作提取结果确定出待识别视频中的动作实例后,在于动作类别判别结果一起,确定待识别视频中的动作类型。利用两个三维神经网络模型得到的不同的识别结果进行结合,可以提高三维神经网络模型的识别效率,减小单个三维神经网络模型的计算量。
根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。
附图说明
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面,并且用于解释本公开的原理。
图1示出根据本公开一实施例的基于神经网络的动作识别方法的流程图;
图2示出根据本公开一实施例的基于神经网络的动作识别方法的流程图;
图3示出根据本公开一实施例的基于神经网络的动作识别方法的流程图;
图4示出根据本公开一实施例的基于神经网络的动作识别方法的流程图;
图5示出根据本公开一实施例的基于神经网络的动作识别方法中生成单帧动作检测框的示意图;
图6示出根据本公开一实施例的基于神经网络的动作识别方法中生成动作实例检测框的示意图;
图7示出根据本公开一实施例的基于神经网络的动作识别方法中确定待识别视频的动作类别的示意图;
图8示出根据本公开一实施例的基于神经网络的动作识别装置的框图;
图9示出根据本公开一实施例的基于神经网络的动作识别装置的框图;
图10示出根据本公开一实施例的基于神经网络的动作识别装置的框图。
具体实施方式
以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。
另外,为了更好的说明本公开,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本公开同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本公开的主旨。
图1示出根据本公开一实施例的基于神经网络的动作识别方法的流程图,如图1所示的基于神经网络的动作识别方法,包括:
步骤S10,将待识别视频输入训练好的第一三维神经网络模型进行处理,得到所述待识别视频的动作提取结果。
在一种可能的实现方式中,待识别视频包括连续的图像帧组成的视频,图像中的人物在进行某种动作类别,如跳远、打篮球、唱歌等。
第一三维神经网络模型包括3D卷积神经网络模型,由多层3D卷积层和多层3D池化层组成,根据待识别视频中的空间信息和时间信息进行建模。其中空间信息包括每 帧图像上的像素点,时间信息包括视频流中的时间信息。动作提取结果包括在待识别视频中提取出的分解动作特征。
步骤S20,根据所述待识别视频的动作提取结果,确定所述待识别视频的动作实例检测结果。
在一种可能的实现方式中,动作实例包括具有时序特征的分解动作,多个动作实例组成动作类别。例如在动作类别中,跳远包括助跑、起跳、落下三个动作实例。根据动作提取结果中的分解动作特征,确定出待识别视频中的各动作实例。
步骤S30,将所述待识别视频输入训练好的第二三维神经网络模型进行处理,得到所述待识别视频的动作类别判别结果。
在一种可能的实现方式中,第二三维神经网络模型包括3D卷积神经网络模型,由多层3D卷积层和多层3D池化层组成,根据待识别视频中的空间信息和时间信息进行建模。其中空间信息包括每帧图像上的像素点,时间信息包括视频流中的时间信息。动作类别判别结果包括在待识别视频中提取出的动作类别特征。
可以理解的是,步骤S30,可以与步骤S10同时执行,在步骤S10之前执行,或在步骤S10之后执行。
步骤S40,根据所述待识别视频的动作实例检测结果和动作类别判别结果,确定所述待识别视频的动作类别。
在本实施例中,利用两个三维神经网络模型对待识别视频进行处理,分别得到动作提取结果和动作类别判别结果。其中,根据动作提取结果确定出待识别视频中的动作实例后,在于动作类别判别结果一起,确定待识别视频中的动作类型。利用两个三维神经网络模型得到的不同的识别结果进行结合,可以提高三维神经网络模型的识别效率,减小单个三维神经网络模型的计算量。
图2示出根据本公开一实施例的基于神经网络的动作识别方法的流程图,如图2所示的方法,在图1所示的实施例的基础上,
步骤S10中所述待识别视频的动作提取结果,包括;所述待识别视频中每帧图像属于一个动作实例中的一个动作位置的第一概率,以及所述待识别视频中的动作起始帧和动作结束帧。
在该实现方式中,动作起始帧包括一个动作实例的起始帧,也包括一个动作类别的起始帧。动作结束帧包括动作类别的结束帧,也包括一个动作类别的结束帧。
动作实例是在时间上持续进行的一个分解动作,每个分解动作具有多个连续的动作位置,提取每个分解动作中具有明显特征的动作位置,可以在后续分析中得到更准确的动作实例。例如在跳远这个动作类别中的起跳动作实例,至少包括双脚离地、跳起、跳至最高点、下落、双脚落地这五个动作位置。
在第一三维神经网络模型的训练过程中所使用的样本视频,标识了动作起始帧、动作结束帧,以及标识了图像中的动作属于预设的一个动作实例中的预设动作位置。利用训练好的第一三维神经网络模型进行处理后得到的动作提取结果包括:待识别视 频中的动作起始帧、动作结束帧,以及所述待识别视频中每帧图像属于一个动作实例中的一个动作位置的第一概率。例如,在待识别视频1中包括20帧图像,其中第6帧为动作起始帧,第20帧为动作结束帧,第6帧为起跳的概率为60%,第12帧为跳至最高点的概率为70%等。
步骤S20包括:
步骤S21,根据所述待识别视频中每帧图像的第一概率,以及所述待识别视频中的动作起始帧和动作结束帧,计算每帧图像中的动作检测框。
在一种可能的实现方式中,图5示出根据本公开一实施例的基于神经网络的动作识别方法中生成单帧动作检测框的示意图,如图5所示,根据每帧图像的第一概率,得出每帧图像中推测出该动作位置的像素点的范围,即动作检测框。例如,在图5右边的图像,可以根据动作检测框中主要包括了手臂部位的像素,推测出动作位置为抬手。
步骤S22,根据所述动作检测框计算每帧图像间的检测框匹配值。
在一种可能的实现方式中,根据每帧图像中的动作检测框中的动作位置的概率,可以计算每帧图像中的检测框匹配值。例如动作实例1最后一帧中的动作检测框中的动作,和动作实例1倒数第二帧中的动作检测框中的动作的匹配值为A;动作实例1最后一帧中的动作检测框中的动作,和动作实例2第一帧中的动作检测框中的动作的匹配值为B。
步骤S23,根据所述检测框匹配值,确定所述待识别视频的动作实例检测框。
在一种可能的实现方式中,相同动作实例中的每帧图像,在动作检测框中的动作具有更强的相关性。容易理解的是,不同的动作实例间的动作之间的匹配度低。因此,上述例子中的匹配值A大于B。根据各动作实例中的动作检测框,确定各动作实例的动作实例检测框。
图6示出根据本公开一实施例的基于神经网络的动作识别方法中生成动作实例检测框的示意图,如图6所示,左侧四帧图像同属于一个动作实例1。根据左侧四帧图像中的动作检测框,确定出右侧的动作实例检测框,动作示例检测框包含动作实例中的所有帧的动作检测框。
在本实施例中,第一三维神经网络模型给出的动作提取结果,包括待识别视频中每帧图像所归属的动作实例的动作位置的概率,增强了区分不同动作实例的能力,以使后续的动作类别的判定过程更加的准确。根据第一三维神经网络模型的动作提取结果,确定待识别视频中的各动作实例检测框。将待识别视频中的各动作实例进行确定后,能够在后续的动作类别的识别过程中,提高动作类别的定位准确率。
图3示出根据本公开一实施例的基于神经网络的动作识别方法的流程图,如图3所示的方法,在上述实施例的基础上,步骤S30中所述待识别视频的动作类别判别结果,包括:每帧图像上的像素所对应的动作类别概率。
在该实施例中,第二三维神经网络模型,直接给出待识别视频中每帧图像上的各像素所对应的动作类别概率。例如第一帧图像中的像素1对应的动作类别是唱歌的概率 是0.3,是跑步的概率为0.5,是踢球的概率为0.2。像素2对应的动作类别是踢球的概率为0.1,是跑步的概率为0.1,是唱歌的概率为0.8。
步骤S40包括:
步骤S41,在每帧图像上的像素所对应的动作类别概率中,确定所述动作实例检测框中的像素所对应的动作类别。
步骤S42,根据所述动作实例检测框中的像素所对应的动作类别,确定所述待识别视频的动作类别。
图7示出根据本公开一实施例的基于神经网络的动作识别方法中确定待识别视频的动作类别的示意图,如图7所示,左上小立方体即为根据第一三维神经网络模型的处理结果,获取到的一个动作实例检测框,动作实例检测框的空间位置,是由动作开始帧、动作结束帧和动作检测框组成的立方体,确定了在待识别视频中判定动作类别的取值范围。
左侧下方大立方体,为第一三维神经网络模型处理后的,每个像素均携带动作类别概率的视频信息。根据左侧上方小立方体在左侧下方大立方体中确定动作类别的判定范围,最终得到右侧小立方体中的像素的动作类别之和。并将右侧小立方体的动作类别中概率最大的动作类别,判定为待识别视频的动作类别。
在本实施例中,通过第二三维神经网络模型给出的动作类别判别结果,可以得出每帧图像上的像素所对应的动作类别的概率,由于对每个像素均进行了动作类别的判断,使得后续整个待识别视频的动作类别识别结果更加准确。并且通过两个三维神经网络模型,分别对时间信息与空间信息进行同时建模,使得动作的定位的鲁棒性更高,且通过提取的动作实例检测框,避免了逐一计算每帧图像中的动作类别特征的负担,降低了动作识别的计算量,同时对于动作的开始、动作的结束,以及动作在特定动作位置的建模,增强了区分不同动作实例的能力,使得动作识别的结果更加准确。
图4示出根据本公开一实施例的基于神经网络的动作识别方法的流程图,如图4所示的方法,在图1所示的实施例的基础上,
步骤S10,包括:
步骤101,将所述待识别视频输入训练好的二维神经网络模型,获取特征值。
步骤102,将所述特征值输入训练好的第一三维神经网络模型进行处理,得到所述待识别视频的动作提取结果。
步骤S30,包括:
步骤301,将所述待识别视频输入训练好的二维神经网络模型,获取特征值。
步骤302,将所述特征值输入训练好的第二三维神经网络模型进行处理,得到所述待识别视频的动作类别判别结果。
在该实施例中,将待识别视频输入二维神经网络模型中进行处理,得到较为泛化的特征表达,提取特征值,再分别输入第一三维神经网络模型和第二三维神经网络模型中进行处理。
在本实施例中,将待识别视频首先输入二维神经网络模型中,进行特征的提取后,能够提高三维神经网络模型的处理效率,从而提高待识别视频的动作类别的判定效率。
图8示出根据本公开一实施例的基于神经网络的动作识别装置的框图,如图8所示,本实施例提供的基于神经网络的动作识别装置,包括:
第一三维识别模块41,用于将待识别视频输入训练好的第一三维神经网络模型进行处理,得到所述待识别视频的动作提取结果;
动作提取结果处理模块42,用于根据所述待识别视频的动作提取结果,确定所述待识别视频的动作实例检测结果;
第二三维识别模块43,用于将所述待识别视频输入训练好的第二三维神经网络模型进行处理,得到所述待识别视频的动作类别判别结果;
动作类别确定模块44,用于根据所述待识别视频的动作实例检测结果和所述待识别视频的动作类别判别结果,确定所述待识别视频的动作类别。
图9示出根据本公开一实施例的基于神经网络的动作识别装置的框图,如图9所示,在图8所示的装置的基础上,
所述待识别视频的动作提取结果,包括;所述待识别视频中每帧图像属于一个动作实例中的一个动作位置的第一概率,以及所述待识别视频中的动作起始帧和动作结束帧。
在一种可能的实现方式中,所述动作提取结果处理模块42,包括:
动作检测框计算子模块421,用于根据所述待识别视频中每帧图像的第一概率,以及所述待识别视频中的动作起始帧和动作结束帧,计算每帧图像中的动作检测框;
匹配值计算子模块422,用于根据所述动作检测框计算每帧图像间的检测框匹配值;
动作实例确定子模块423,用于根据所述检测框匹配值,确定所述待识别视频的动作实例检测框。
在一种可能的实现方式中,所述待识别视频的动作类别判别结果,包括:每帧图像上的像素所对应的动作类别概率。
在一种可能的实现方式中,所述动作类别确定模块44,包括:
第一动作类别确定子模块441,用于在每帧图像上的像素所对应的动作类别概率中,确定所述动作实例检测框中的像素所对应的动作类别;
第二动作类别确定子模块442,用于根据所述动作实例检测框中的像素所对应的动作类别,确定所述待识别视频的动作类别。
在一种可能的实现方式中,所述第一三维识别模块41,包括:
第一二维识别子模块411,用于将所述待识别视频输入训练好的二维神经网络模型,获取特征值;
第一三维识别子模块412,用于将所述特征值输入训练好的第一三维神经网络模型进行处理,得到所述待识别视频的动作提取结果;
所述第二三维识别模块43,包括:
第二二维识别子模块431,用于将所述待识别视频输入训练好的二维神经网络模型,获取特征值;
第二三维识别子模块432,用于将所述特征值输入训练好的第二三维神经网络模型进行处理,得到所述待识别视频的动作类别判别结果。
图10是根据一示例性实施例示出的一种基于神经网络的动作识别装置1900的框图。例如,装置1900可以被提供为一服务器。参照图10,装置1900包括处理组件1922,其进一步包括一个或多个处理器,以及由存储器1932所代表的存储器资源,用于存储可由处理组件1922的执行的指令,例如应用程序。存储器1932中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件1922被配置为执行指令,以执行上述方法。
装置1900还可以包括一个电源组件1926被配置为执行装置1900的电源管理,一个有线或无线网络接口1950被配置为将装置1900连接到网络,和一个输入输出(I/O)接口1958。装置1900可以操作基于存储在存储器1932的操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM或类似。
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器1932,上述计算机程序指令可由装置1900的处理组件1922执行以完成上述方法。
本公开可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本公开的各个方面的计算机可读程序指令。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。
这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且 也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (14)

  1. 一种基于神经网络的动作识别方法,其特征在于,所述方法包括:
    将待识别视频输入训练好的第一三维神经网络模型进行处理,得到所述待识别视频的动作提取结果;
    根据所述待识别视频的动作提取结果,确定所述待识别视频的动作实例检测结果;
    将所述待识别视频输入训练好的第二三维神经网络模型进行处理,得到所述待识别视频的动作类别判别结果;
    根据所述待识别视频的动作实例检测结果和动作类别判别结果,确定所述待识别视频的动作类别。
  2. 根据权利要求1所述的方法,其特征在于,所述待识别视频的动作提取结果,包括;
    所述待识别视频中每帧图像属于一个动作实例中的一个动作位置的第一概率,以及
    所述待识别视频中的动作起始帧和动作结束帧。
  3. 根据权利要求2所述的方法,其特征在于,根据所述待识别视频的动作提取结果,确定所述待识别视频的动作实例检测结果,包括:
    根据所述待识别视频中每帧图像的第一概率,以及所述待识别视频中的动作起始帧和动作结束帧,计算每帧图像中的动作检测框;
    根据所述动作检测框计算每帧图像间的检测框匹配值;
    根据所述检测框匹配值,确定所述待识别视频的动作实例检测框。
  4. 根据权利要求3所述的方法,其特征在于,所述待识别视频的动作类别判别结果,包括:
    每帧图像上的像素所对应的动作类别概率。
  5. 根据权利要求4所述的方法,其特征在于,根据所述待识别视频的动作实例检测结果和动作类别判别结果,确定所述待识别视频的动作类别,包括:
    在每帧图像上的像素所对应的动作类别概率中,确定所述动作实例检测框中的像素所对应的动作类别;
    根据所述动作实例检测框中的像素所对应的动作类别,确定所述待识别视频的动作类别。
  6. 根据权利要求1所述的方法,其特征在于,将所述待识别视频输入训练好的第一三维神经网络模型进行处理,得到所述待识别视频的动作提取结果,包括:
    将所述待识别视频输入训练好的二维神经网络模型,获取特征值;
    将所述特征值输入训练好的第一三维神经网络模型进行处理,得到所述待识别视频的动作提取结果;
    将所述待识别视频输入训练好的第二三维神经网络模型进行处理,得到所述待识别视频的动作类别判别结果,包括:
    将所述待识别视频输入训练好的二维神经网络模型,获取特征值;
    将所述特征值输入训练好的第二三维神经网络模型进行处理,得到所述待识别视频的动作类别判别结果。
  7. 一种基于神经网络的动作识别装置,其特征在于,包括:
    第一三维识别模块,用于将待识别视频输入训练好的第一三维神经网络模型进行处理,得到所述待识别视频的动作提取结果;
    动作提取结果处理模块,用于根据所述待识别视频的动作提取结果,确定所述待识别视频的动作实例检测结果;
    第二三维识别模块,用于将所述待识别视频输入训练好的第二三维神经网络模型进行处理,得到所述待识别视频的动作类别判别结果;
    动作类别确定模块,用于根据所述待识别视频的动作实例检测结果和动作类别判别结果,确定所述待识别视频的动作类别。
  8. 根据权利要求7所述的装置,其特征在于,所述待识别视频的动作提取结果,包括;
    所述待识别视频中每帧图像属于一个动作实例中的一个动作位置的第一概率,以及
    所述待识别视频中的动作起始帧和动作结束帧。
  9. 根据权利要求8所述的装置,其特征在于,所述动作提取结果处理模块,包括:
    动作检测框计算子模块,用于根据所述待识别视频中每帧图像的第一概率,以及所述待识别视频中的动作起始帧和动作结束帧,计算每帧图像中的动作检测框;
    匹配值计算子模块,用于根据所述动作检测框计算每帧图像间的检测框匹配值;
    动作实例确定子模块,用于根据所述检测框匹配值,确定所述待识别视频的动作实例检测框。
  10. 根据权利要求9所述的装置,其特征在于,所述待识别视频的动作类别判别结果,包括:
    每帧图像上的像素所对应的动作类别概率。
  11. 根据权利要求10所述的装置,其特征在于,所述动作类别确定模块,包括:
    第一动作类别确定子模块,用于在每帧图像上的像素所对应的动作类别概率中,确定所述动作实例检测框中的像素所对应的动作类别;
    第二动作类别确定子模块,用于根据所述动作实例检测框中的像素所对应的动作类别,确定所述待识别视频的动作类别。
  12. 根据权利要求7所述的装置,其特征在于,所述第一三维识别模块,包括:
    第一二维识别子模块,用于将所述待识别视频输入训练好的二维神经网络模型,获取特征值;
    第一三维识别子模块,用于将所述特征值输入训练好的第一三维神经网络模型进行处理,得到所述待识别视频的动作提取结果;
    所述第二三维识别模块,包括:
    第二二维识别子模块,用于将所述待识别视频输入训练好的二维神经网络模型,获取特征值;
    第二三维识别子模块,用于将所述特征值输入训练好的第二三维神经网络模型进行处理,得到所述待识别视频的动作类别判别结果。
  13. 一种基于神经网络的动作识别装置,其特征在于,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为:执行权利要求1至6中任一项所述的方法。
  14. 一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1至6中任意一项所述的方法。
PCT/CN2018/114487 2017-11-09 2018-11-08 基于神经网络的识别方法与装置 WO2019091417A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2020524869A JP6920771B2 (ja) 2017-11-09 2018-11-08 3d畳み込みニューラルネットワークに基づく動作識別方法及び装置

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711097227.2 2017-11-09
CN201711097227.2A CN107766839B (zh) 2017-11-09 2017-11-09 基于3d卷积神经网络的动作识别方法和装置

Publications (1)

Publication Number Publication Date
WO2019091417A1 true WO2019091417A1 (zh) 2019-05-16

Family

ID=61272228

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/114487 WO2019091417A1 (zh) 2017-11-09 2018-11-08 基于神经网络的识别方法与装置

Country Status (3)

Country Link
JP (1) JP6920771B2 (zh)
CN (1) CN107766839B (zh)
WO (1) WO2019091417A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110738101A (zh) * 2019-09-04 2020-01-31 平安科技(深圳)有限公司 行为识别方法、装置及计算机可读存储介质
CN111291641A (zh) * 2020-01-20 2020-06-16 上海依图网络科技有限公司 图像识别方法及其装置、计算机可读介质和系统
CN111797745A (zh) * 2020-06-28 2020-10-20 北京百度网讯科技有限公司 一种物体检测模型的训练及预测方法、装置、设备及介质
CN112587129A (zh) * 2020-12-01 2021-04-02 上海影谱科技有限公司 一种人体动作识别方法及装置
CN112767534A (zh) * 2020-12-31 2021-05-07 北京达佳互联信息技术有限公司 视频图像处理方法、装置、电子设备及存储介质
CN112949359A (zh) * 2019-12-10 2021-06-11 清华大学 基于卷积神经网络的异常行为识别方法和装置
CN113657301A (zh) * 2021-08-20 2021-11-16 北京百度网讯科技有限公司 基于视频流的动作类型识别方法、装置及穿戴设备

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766839B (zh) * 2017-11-09 2020-01-14 清华大学 基于3d卷积神经网络的动作识别方法和装置
US20190279100A1 (en) * 2018-03-09 2019-09-12 Lattice Semiconductor Corporation Low latency interrupt alerts for artificial neural network systems and methods
CN108681690B (zh) * 2018-04-04 2021-09-03 浙江大学 一种基于深度学习的流水线人员规范操作检测系统
CN108875601A (zh) * 2018-05-31 2018-11-23 郑州云海信息技术有限公司 动作识别方法和lstm神经网络训练方法和相关装置
JP7268063B2 (ja) 2018-06-29 2023-05-02 バイドゥドットコム タイムズ テクノロジー (ベイジン) カンパニー リミテッド 低電力のリアルタイムオブジェクト検出用のシステム及び方法
CN109086873B (zh) * 2018-08-01 2021-05-04 北京旷视科技有限公司 递归神经网络的训练方法、识别方法、装置及处理设备
CN109344755B (zh) * 2018-09-21 2024-02-13 广州市百果园信息技术有限公司 视频动作的识别方法、装置、设备及存储介质
CN111126115B (zh) * 2018-11-01 2024-06-07 顺丰科技有限公司 暴力分拣行为识别方法和装置
CN111435422B (zh) * 2019-01-11 2024-03-08 商汤集团有限公司 动作识别方法、控制方法及装置、电子设备和存储介质
CN111488773B (zh) * 2019-01-29 2021-06-11 广州市百果园信息技术有限公司 一种动作识别方法、装置、设备及存储介质
US10902289B2 (en) * 2019-03-22 2021-01-26 Salesforce.Com, Inc. Two-stage online detection of action start in untrimmed videos
CN110427807B (zh) * 2019-06-21 2022-11-15 诸暨思阔信息科技有限公司 一种时序事件动作检测方法
CN110516572B (zh) * 2019-08-16 2022-06-28 咪咕文化科技有限公司 一种识别体育赛事视频片段的方法、电子设备及存储介质
CN111444895B (zh) * 2020-05-08 2024-04-19 商汤集团有限公司 视频处理方法、装置、电子设备及存储介质
CN112115788A (zh) * 2020-08-14 2020-12-22 咪咕文化科技有限公司 视频动作识别方法、装置、电子设备及存储介质
CN114333065A (zh) * 2021-12-31 2022-04-12 济南博观智能科技有限公司 一种应用于监控视频的行为识别方法、系统及相关装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104966104A (zh) * 2015-06-30 2015-10-07 孙建德 一种基于三维卷积神经网络的视频分类方法
CN105976400A (zh) * 2016-05-10 2016-09-28 北京旷视科技有限公司 基于神经网络模型的目标跟踪方法及装置
US20170112372A1 (en) * 2015-10-23 2017-04-27 International Business Machines Corporation Automatically detecting eye type in retinal fundus images
CN106845411A (zh) * 2017-01-19 2017-06-13 清华大学 一种基于深度学习和概率图模型的视频描述生成方法
CN107766839A (zh) * 2017-11-09 2018-03-06 清华大学 基于神经网络的动作识别方法和装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503659B (zh) * 2016-10-24 2019-10-15 天津大学 基于稀疏编码张量分解的动作识别方法
CN106557165B (zh) * 2016-11-14 2019-06-21 北京儒博科技有限公司 智能设备的动作模拟交互方法和装置及智能设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104966104A (zh) * 2015-06-30 2015-10-07 孙建德 一种基于三维卷积神经网络的视频分类方法
US20170112372A1 (en) * 2015-10-23 2017-04-27 International Business Machines Corporation Automatically detecting eye type in retinal fundus images
CN105976400A (zh) * 2016-05-10 2016-09-28 北京旷视科技有限公司 基于神经网络模型的目标跟踪方法及装置
CN106845411A (zh) * 2017-01-19 2017-06-13 清华大学 一种基于深度学习和概率图模型的视频描述生成方法
CN107766839A (zh) * 2017-11-09 2018-03-06 清华大学 基于神经网络的动作识别方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JI, SHUIWANG ET AL.: "3D Convolutional Neural Networks for Human Action Recognition", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 35, no. 1, 31 January 2013 (2013-01-31), pages 221 - 231, XP011490774, DOI: doi:10.1109/TPAMI.2012.59 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110738101A (zh) * 2019-09-04 2020-01-31 平安科技(深圳)有限公司 行为识别方法、装置及计算机可读存储介质
CN110738101B (zh) * 2019-09-04 2023-07-25 平安科技(深圳)有限公司 行为识别方法、装置及计算机可读存储介质
CN112949359A (zh) * 2019-12-10 2021-06-11 清华大学 基于卷积神经网络的异常行为识别方法和装置
CN111291641A (zh) * 2020-01-20 2020-06-16 上海依图网络科技有限公司 图像识别方法及其装置、计算机可读介质和系统
CN111291641B (zh) * 2020-01-20 2024-02-27 上海依图网络科技有限公司 图像识别方法及其装置、计算机可读介质和系统
CN111797745A (zh) * 2020-06-28 2020-10-20 北京百度网讯科技有限公司 一种物体检测模型的训练及预测方法、装置、设备及介质
CN112587129A (zh) * 2020-12-01 2021-04-02 上海影谱科技有限公司 一种人体动作识别方法及装置
CN112587129B (zh) * 2020-12-01 2024-02-02 上海影谱科技有限公司 一种人体动作识别方法及装置
CN112767534A (zh) * 2020-12-31 2021-05-07 北京达佳互联信息技术有限公司 视频图像处理方法、装置、电子设备及存储介质
CN112767534B (zh) * 2020-12-31 2024-02-09 北京达佳互联信息技术有限公司 视频图像处理方法、装置、电子设备及存储介质
CN113657301A (zh) * 2021-08-20 2021-11-16 北京百度网讯科技有限公司 基于视频流的动作类型识别方法、装置及穿戴设备

Also Published As

Publication number Publication date
CN107766839B (zh) 2020-01-14
CN107766839A (zh) 2018-03-06
JP6920771B2 (ja) 2021-08-18
JP2021502638A (ja) 2021-01-28

Similar Documents

Publication Publication Date Title
WO2019091417A1 (zh) 基于神经网络的识别方法与装置
US20230215160A1 (en) Action recognition using implicit pose representations
US9892326B2 (en) Object detection in crowded scenes using context-driven label propagation
CN110622176A (zh) 视频分区
US11468680B2 (en) Shuffle, attend, and adapt: video domain adaptation by clip order prediction and clip attention alignment
CN108009466B (zh) 行人检测方法和装置
JP2019003299A (ja) 画像認識装置および画像認識方法
CN113642431A (zh) 目标检测模型的训练方法及装置、电子设备和存储介质
US20220222977A1 (en) Person verification device and method and non-transitory computer readable media
US11756205B2 (en) Methods, devices, apparatuses and storage media of detecting correlated objects involved in images
CN113378770A (zh) 手势识别方法、装置、设备、存储介质以及程序产品
CN109558790B (zh) 一种行人目标检测方法、装置及系统
CN108875506B (zh) 人脸形状点跟踪方法、装置和系统及存储介质
CN113869205A (zh) 对象检测方法、装置、电子设备和存储介质
US20220300774A1 (en) Methods, apparatuses, devices and storage media for detecting correlated objects involved in image
CN114120454A (zh) 活体检测模型的训练方法、装置、电子设备及存储介质
CN116740607A (zh) 视频处理方法及装置、电子设备和存储介质
WO2023185074A1 (zh) 一种基于互补时空信息建模的群体行为识别方法
WO2023077897A1 (zh) 人体检测方法及装置、电子设备、计算机可读存储介质
CN114419564B (zh) 车辆位姿检测方法、装置、设备、介质及自动驾驶车辆
CN114220163B (zh) 人体姿态估计方法、装置、电子设备及存储介质
US20220237884A1 (en) Keypoint based action localization
Liu et al. Building semantic maps for blind people to navigate at home
KR20150042674A (ko) 환경 변화에 강인한 멀티모달 사용자 인식
CN108694347B (zh) 图像处理方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18877249

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020524869

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18877249

Country of ref document: EP

Kind code of ref document: A1