CN113688804B - Multi-angle video-based action identification method and related equipment - Google Patents

Multi-angle video-based action identification method and related equipment Download PDF

Info

Publication number
CN113688804B
CN113688804B CN202111241878.0A CN202111241878A CN113688804B CN 113688804 B CN113688804 B CN 113688804B CN 202111241878 A CN202111241878 A CN 202111241878A CN 113688804 B CN113688804 B CN 113688804B
Authority
CN
China
Prior art keywords
action
reference image
image frame
target object
description
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111241878.0A
Other languages
Chinese (zh)
Other versions
CN113688804A (en
Inventor
丁强刚
黄予
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111241878.0A priority Critical patent/CN113688804B/en
Publication of CN113688804A publication Critical patent/CN113688804A/en
Application granted granted Critical
Publication of CN113688804B publication Critical patent/CN113688804B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention discloses a multi-angle video-based action identification method and related equipment, and the related embodiments can be applied to various scenes such as cloud technology, cloud security, artificial intelligence, intelligent traffic and the like. Wherein, the method comprises the following steps: the method comprises the steps of acquiring N detection videos of multiple angles acquired by a target object, and after image frame extraction processing and weighted fusion processing are carried out on each detection video, obtaining an action description diagram for recording action information executed by the target object in the detection video at a corresponding shooting angle.

Description

Multi-angle video-based action identification method and related equipment
Technical Field
The application relates to the technical field of computers, in particular to a multi-angle video-based action identification method and related equipment.
Background
With the continuous and deep development of internet technology, computer technology can be used to assist users in production and life, for example, the users can be assisted in action judgment based on the computer technology. However, when the action is determined based on the computer technology, the action is mainly determined by analyzing the collected data of the sensor of the device worn by the user, and since the sensor of the device worn by the user has low precision and poor interference ability in the environment, the action determination is naturally performed based on the analysis of the collected data of the sensor, so that the accuracy is low.
Disclosure of Invention
The embodiment of the application provides a multi-angle video-based action recognition method and related equipment, which can improve the accuracy of target action recognition on an object in target video data.
In one aspect, an embodiment of the present invention provides a method for motion recognition based on a multi-angle video, including:
acquiring N detection videos shot by a target object at a plurality of different shooting angles, wherein N is a positive integer greater than or equal to 2;
performing image frame extraction processing and weighted fusion processing on the N detection videos to obtain M action description graphs, wherein M is a positive integer, and one action description graph is used for recording action information executed by a target object in the detection videos at a corresponding shooting angle;
performing information fusion processing on the action information recorded by each action description diagram to obtain an action fusion description diagram used for describing the action information of the target object at a plurality of different shooting angles;
and identifying the action executed by the target object according to the action fusion description graph to obtain an action identification result aiming at the target object, wherein the action identification result is used for representing whether the target object executes the target action.
In another aspect, an embodiment of the present invention provides a motion recognition apparatus based on a multi-angle video, including:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring N detection videos obtained by shooting a target object at a plurality of different shooting angles, wherein N is a positive integer greater than or equal to 2;
the processing unit is used for carrying out image frame extraction processing and weighted fusion processing on the N detection videos to obtain M action description graphs, wherein M is a positive integer, and one action description graph is used for recording action information of a target object in the detection videos executed under a corresponding shooting angle;
the processing unit is further configured to perform information fusion processing on the motion information recorded in each motion description map to obtain a motion fusion description map used for describing the motion information of the target object at a plurality of different shooting angles;
and the identification unit is used for identifying the action executed by the target object according to the action fusion description graph to obtain an action identification result aiming at the target object, and the action identification result is used for representing whether the target object executes the target action.
In still another aspect, an embodiment of the present invention provides a computer device, including a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, where the memory is used to store a computer program that supports the computer device to execute the above method, the computer program includes program instructions, and the processor is configured to call the program instructions to perform the following steps:
acquiring N detection videos shot by a target object at a plurality of different shooting angles, wherein N is a positive integer greater than or equal to 2;
performing image frame extraction processing and weighted fusion processing on the N detection videos to obtain M action description graphs, wherein M is a positive integer, and one action description graph is used for recording action information executed by a target object in the detection videos at a corresponding shooting angle;
performing information fusion processing on the action information recorded by each action description diagram to obtain an action fusion description diagram used for describing the action information of the target object at a plurality of different shooting angles;
and identifying the action executed by the target object according to the action fusion description graph to obtain an action identification result aiming at the target object, wherein the action identification result is used for representing whether the target object executes the target action.
In still another aspect, an embodiment of the present invention provides a computer-readable storage medium, in which program instructions are stored, and when the program instructions are executed by a processor, the program instructions are used for executing the method for motion recognition based on multi-angle video according to the first aspect.
In the embodiment of the application, in the process of identifying the action of the target object to determine whether the action performed by the target object is the target action, the computer device may first acquire a plurality of detection videos obtained by shooting the target object at different angles, further may perform image frame extraction processing and weighted fusion processing on each of the plurality of acquired detection videos, and obtain a plurality of action description maps, and after obtaining the action description map of each detection video, based on the record of the relevant information of the action performed by the target object at the corresponding shooting angle in the action description map, the computer device may further perform information fusion processing on the obtained action description map, thereby obtaining the action fusion description map of the target object, so that the computer device may acquire the action information of the target object at different angles, then, based on the recognition processing of the computer device on the action fusion description graph, the computer device can acquire the action information corresponding to the target object at different shooting angles, so that the actions of the target object at different angles can be recognized, and the computer device can recognize the action of the target object at multiple angles based on the recognition processing of the action fusion description graph, so that the accuracy and the confidence coefficient of the computer device in action judgment of the target object can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of an object-specific motion recognition system according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a multi-angle video-based motion recognition method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of one action provided by an embodiment of the present invention;
FIG. 4 is a schematic flow chart of another method for motion recognition based on multi-angle video according to an embodiment of the present invention;
FIG. 5a is a schematic diagram of a first algorithm according to an embodiment of the present invention;
FIG. 5b is a diagram illustrating a second algorithm according to an embodiment of the present invention;
FIG. 5c is a diagram illustrating a multi-angle video-based motion recognition method according to an embodiment of the present invention;
FIG. 5d is a diagram illustrating comparison results of motion detection by different methods according to an embodiment of the present invention;
FIG. 5e is a diagram of an output result of target motion detection according to an embodiment of the present invention;
fig. 6 is a schematic block diagram of a motion recognition apparatus according to an embodiment of the present invention;
fig. 7 is a schematic block diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
The embodiment of the application provides a motion recognition method based on multi-angle videos, which enables computer equipment to firstly acquire different-angle shooting of a target object when detecting whether the target object executes target motion, thereby obtaining a plurality of detection videos, further performing information fusion processing on motion information recorded in a motion description graph corresponding to each detection video, obtaining a motion fusion description graph for describing the motion information of the target object at a plurality of different angles, further enabling the computer equipment to determine whether the target object executes target motion according to the recognition result of the motion information of the target object at a plurality of different angles recorded in the motion fusion description graph, further enabling the computer equipment to realize target motion detection based on multi-view shooting information of the target object when the target object executes target motion detection, the accuracy and the credibility of the computer equipment in the process of judging the target action can be improved. The target object may refer to a random object of a certain type, such as any person and/or any animal; alternatively, the target object may also refer to a specific type of object, such as an elderly person and/or reptile over the age of 60 (or age of 70, etc.), and in the embodiment of the present application, the target object is mainly a human object for detailed description. Further, the goal action may be a fall action, a run action, an illegal action, or the like, and thus, based on the detection of the goal action on the target subject, the state of the subject in which the target subject is currently located may be determined, such as whether the target subject has fallen according to the fall action, whether the target subject is performing a physical exercise according to the run action, whether the target subject is performing an irregular action based on the illegal action, or the like, and based on the determination of the state of the subject by the computer device, the computer device may be caused to push subsequent related services to the target subject based on the state of the subject, such as sending a reminder for the related subject of the target subject when the target subject is determined to have fallen, or sending a reminder for performing an exercise on schedule for the target subject when the target subject is determined to have not performed a physical exercise, or when the target subject is determined to be performing an illegal action, in the embodiment of the present application, the target action is mainly set as a falling action for detailed explanation, and when the target action is other action, the embodiment of the present application may be referred to.
In an embodiment, when the computer device detects a target action on a target object, the obtained detection videos for different angles of the target object may be determined by a video obtained by shooting the target object by one image capturing device based on different angles, or the detection video may also be determined by shooting the target object by a plurality of different image capturing devices according to different angles, that is, it can be understood that each of the plurality of detection videos collected by the computer device includes the target object, and then the computer device may further process the detection video, so as to determine whether the target object included in the detection video performs the target action. In an embodiment, after the computer device obtains a plurality of detection videos, any one of the detection videos may be converted into a corresponding action description diagram, so as to obtain a plurality of action description diagrams, and then the obtained plurality of action description diagrams are fused, and the action fusion description diagram is subjected to recognition processing, so as to determine whether the target object executes a target action. Specifically, as shown in fig. 1, if the number of detected videos acquired by a computer device is N, the computer device may acquire motion description diagrams corresponding to detected videos 1 to N, and further may determine whether a target object performs a target motion on a recognition result of the motion fusion description diagram obtained by fusing the motion description diagrams, where one motion description diagram is obtained from one detected video and one detected video may obtain one or more motion description diagrams correspondingly, for example, one detected video may be input into different branch networks of a multi-branch convolutional neural network, so as to obtain a plurality of corresponding different motion description diagrams. The multi-branch convolutional neural network is a network model generated through Machine Learning (ML), wherein the Machine Learning is a multi-field cross subject and relates to multi-subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The machine learning is the core of Artificial Intelligence (AI), which is the fundamental approach for making computer devices intelligent, and is applied in all fields of Artificial Intelligence.
In one embodiment, the computer device obtains the action description graph of each detection video, the process of fusing the obtained action description graph can be realized by calling a multi-branch convolutional neural network, wherein the multi-branch convolutional neural network comprises a plurality of branch networks and a backbone network, then, the computer equipment can input a detection video into a branch network, so that the corresponding branch network can be called to process the input detection video to obtain an action description diagram of the corresponding detection video, after the action description diagram of the corresponding detection video is obtained by each branch network, the action description diagram obtained by each branch network can be further input into the main network of the multi-branch convolutional neural network, and performing information fusion processing on the action description diagrams obtained by each branch network by the main network to obtain the action fusion description diagrams. In an embodiment, the multi-branch convolutional neural network may be implemented by a deep convolutional neural network, for example, a first residual convolutional module of the multi-branch convolutional neural network may be used as a processing module of the multi-branch network, and a subsequent residual convolutional module may be used as a processing module of a main network, that is, a computer device calls a certain branch network to process a corresponding detected video, so as to obtain a process of an action description map of the corresponding detected video, that is, calls the first residual convolutional module of the deep convolutional neural network to process the corresponding detected video, and calls the main network to fuse the obtained action description maps, that is, calls the subsequent residual convolutional module of the deep convolutional neural network to process the action description map.
The deep convolutional neural network may be ResNet-18 (a deep convolutional network), or VGGNet (a deep convolutional network), inclusion net (a deep convolutional network), ResNet (a deep convolutional network), or the like. The method has the advantages that the computer equipment is executed by calling the same convolution module when calling different branch networks to process different detection videos, so that the algorithm consistency of the computer equipment in the process of extracting the action description diagram based on the detection videos can be ensured, and the problem that the follow-up fusion cannot be carried out due to the problem of inconsistent algorithms in the process of extracting the action description diagram of the detection videos can be avoided.
Please refer to fig. 2, which is a schematic flowchart illustrating a method for motion recognition based on multi-angle video according to an embodiment of the present application, and as shown in fig. 2, the method may include:
s201, acquiring N detection videos shot by a target object at a plurality of different shooting angles, wherein N is a positive integer greater than or equal to 2.
The N detection videos acquired by the computer equipment comprise: the method comprises the steps that shooting videos of different angles are obtained by shooting a target object through image acquisition equipment, after the shooting videos of different angles aiming at the target object are acquired through the image acquisition equipment, the computer equipment can directly use the shooting videos as detection videos to achieve real-time detection on whether target action is executed on the target object, or the computer equipment can also firstly serve the shooting videos as detection videos to be processed after the shooting videos of different angles aiming at the target object are acquired through the image acquisition equipment, and when the target action needs to be detected in the follow-up determination, the detection videos are obtained from the cache and a subsequent detection processing process is executed. In one embodiment, the number of the detection videos acquired by the computer device from the image acquisition device is N, and N is a positive integer greater than or equal to 2, that is, the computer device performs target motion detection on the target object based on at least the detection videos of two shooting angles for the target object, so as to improve the accuracy of the target object detection performed by the computer device, and the target motion detection performed based on the detection videos shot at different shooting angles is more robust than the target motion detection performed based on the detection video of a single shooting angle.
In an embodiment, the N detection videos may be acquired by the same image acquisition device at different shooting angles, or acquired by different image acquisition devices, and in this embodiment, the detection video acquired by the computer device is mainly acquired by shooting the target object at different angles by different image acquisition devices at the same time period, so that the computer device can analyze whether the target object performs the target action within the time period. When the detection video is shot by the same image acquisition device at different shooting angles, because the same image acquisition device cannot shoot at different angles at the same time to obtain shooting videos at different angles, if the detection video obtained by the computer device is a shooting video based on different angles, the computer device can detect whether the target object executes the target action within the shooting time period based on the obtained shooting videos at different angles. The image acquisition device may be a camera or the like, the computer device may be a server or a terminal device, and the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. The terminal device may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, a vehicle-mounted terminal, a smart television, a smart voice interaction device, and the like. After the computer device acquires the N detection videos, the computer device may further obtain an action description map of each detection video, so as to determine whether the target object performs the target action based on the obtained action description maps, that is, the computer device may proceed to perform step S202.
S202, performing image frame extraction processing and weighted fusion processing on the N detection videos to obtain M action description graphs, wherein M is a positive integer, and one action description graph is used for recording action information of a target object in the detection videos executed under a corresponding shooting angle.
The action description diagram is an image used for recording action information executed by a target object in a detection video at a shooting angle, and it can be understood that when the action description diagram records the action information, actions executed at the corresponding shooting angle in the action description diagram are highlighted at a corresponding execution position, and other positions where the actions are not executed in the action description diagram are not highlighted, so that it can be understood that the action description diagram can effectively reflect relevant information of the actions executed by the target object in the detection video at the corresponding shooting angle, so that the computer device can determine whether the target object executes the actions and the specific types of the executed actions through the identification processing of the action description diagram. In one embodiment, the motion of an object (such as the target object mentioned above) refers to the position change of the limbs of the object relative to the environment within a certain time range, an action such as walking is a change in the position of the subject's foot relative to the ground over a particular time frame, whereas the act of falling is a change in the position of the body part of the subject relative to the ground in a short time frame, etc., it follows that, the motion of the object is related to the sequence of time (or time sequence), as shown in fig. 3, if the object is marked by 301 in fig. 3 at time t1, and is marked by 302 in fig. 3 at time t2, then, the action performed by the object is a hand-up action, and if the object is shown at time t1 as indicated by 302 in figure 3, at time t2, as indicated by the 301 mark in FIG. 3, the action performed by the object is a hand-down action. Then, based on the relationship between the motion and the time sequence, the computer device determines the display time corresponding to each image frame in the detection video when converting the detection video into the corresponding motion description map based on the image frame extraction and the weighted fusion process, wherein if the number of the detection videos acquired by the computer device is N and N is a positive integer greater than or equal to 2, the computer device may obtain M motion description maps after processing the N detection videos based on the image frame extraction and the weighted fusion process, wherein M is a positive integer greater than or equal to N, that is, the one or more motion description maps may be obtained based on the image frame extraction and the weighted fusion process for one detection video, but one motion description map corresponds to one detection video, in the embodiment of the present application, the detailed description is mainly given by using a computer device to perform image frame extraction processing and weighting fusion processing on a detection video to obtain an action description diagram.
In one embodiment, when the computer device converts any detection video and obtains a corresponding motion description map, the computer device may first perform image frame extraction processing on any detection video to obtain one or more reference image frames, wherein each reference image frame extracted from any detection video includes a target object, and the extracted reference image frame may be a partial or whole image frame in the detection video, and after the computer device extracts one or more reference image frames from any detection video, the time importance of the display time of any reference image frame may be determined based on the display time of each extracted reference image frame, wherein the time importance of the display time corresponding to any reference image frame may indicate the importance of the corresponding display time in performing the motion description, that is, the greater the temporal importance of a certain display time, the greater the role of the reference image frame corresponding to the display time in performing motion determination, and the greater the importance of the reference image frame corresponding to the display time in performing motion determination, that is, the greater the importance score of the reference image frame corresponding to the display time. Therefore, after the computer device extracts one or more reference image frames from any detection video, the corresponding reference image frames can be subjected to weighted fusion processing based on the importance scores of any reference image frame, so as to obtain the action description map of any detection video.
When the computer device determines the time importance of the display time corresponding to any reference image frame in the one or more reference image frames, the computer device may obtain the display time corresponding to the any reference image frame, assuming T, and obtain the display duration of the one or more reference image frames, assuming T, then the computer device may perform a blending operation on the display time of the any reference image frame based on the display time (T) of the any reference image frame and the display duration (T) of the image frame sequence, so that the computer device obtains the time importance of the display time of the any reference image frame. After the computer device obtains the time importance, the time importance can be directly used as the importance score of the corresponding reference image frame, or the computer device can also pre-process the time importance and use the processed time importance as the importance score of the corresponding reference image frame. After the importance score of any reference image frame in the detection video is obtained by the computer device, the corresponding reference image frame can be subjected to weighted summation (or weighted fusion) based on the importance score, so that an action description diagram corresponding to the detection video is obtained, wherein one detection video obtained by the computer device can be converted to obtain a corresponding action description diagram, and then the computer device can be converted to obtain N corresponding action description diagrams based on the obtained N detection videos.
And S203, performing information fusion processing on the motion information recorded by each motion description diagram to obtain motion fusion description diagrams for describing the motion information of the target object at a plurality of different shooting angles.
And S204, identifying the action executed by the target object according to the action fusion description graph to obtain an action identification result aiming at the target object, wherein the action identification result is used for representing whether the target object executes the target action.
In steps S203 and S204, after the computer device obtains one or more motion description maps corresponding to N detection videos, if the N detection videos are obtained by shooting the target object at different angles at the same time by different image capturing devices, so that the computer device can obtain the shot videos at different angles at the same time for the target object, based on the obtained shot videos at different angles, that is, the multiple detection videos obtained by the computer device are object detection videos at different angles, correspondingly, the computer device performs information fusion processing on the obtained multiple motion description maps based on motion information included in the motion description map obtained by converting the detection videos and motion information of the target object at different angles, so that after obtaining the motion description map corresponding to each detection video, the computer device can perform information fusion processing on the obtained multiple motion description maps, thereby obtaining the action fusion description graph of the target object, and the action fusion description graph can indicate the actions of the target object at a plurality of different visual angles. In one embodiment, when performing information fusion processing on the obtained multiple action description maps to obtain corresponding action fusion description maps, the computer device may perform fusion merging on action information in the obtained action description maps through a pooling (posing) operation, and when performing information fusion processing on the action description maps based on the pooling operation, the computer device may first extract effective information for describing an action of a target object (i.e., action information of the target object) from the action description maps and perform information fusion processing on the extracted information to obtain an action fusion description map of the target object, where the information extracted by the computer device includes a pixel region in the action description map where the action information is recorded.
In an embodiment, after the computer device performs information fusion processing on the obtained action description graph to obtain an action fusion description graph of the target object, the computer device may perform feature recognition by using the action fusion description graph to further obtain an action execution result for the target object. Furthermore, it can be understood that the process of recognizing the characteristics of the motion fusion description graph by the computer device is a process of recognizing motion information of the target object at a plurality of different angles, so that when the computer device discriminates the target object, the computer device can combine descriptions of motions executed by the target object at different angles, and then the computer device can combine and discriminate the executed motions by combining the motion information of multiple angles, and based on the acquisition of the motion information of the target object at different angles by the computer device, the computer device can acquire discrimination information of the target object at different receptive fields during the process of detecting the motions of the target object, thereby improving the accuracy of detecting the motions of the target object by the computer device.
In the embodiment of the application, in the process of identifying the action of the target object to determine whether the action performed by the target object is the target action, the computer device may first acquire a plurality of detection videos obtained by shooting the target object at different angles, further may perform image frame extraction processing and weighted fusion processing on each of the plurality of acquired detection videos, and obtain a plurality of action description maps, and after obtaining the action description map of each detection video, based on the record of the relevant information of the action performed by the target object at the corresponding shooting angle in the action description map, the computer device may further perform information fusion processing on the obtained action description map, thereby obtaining the action fusion description map of the target object, so that the computer device may acquire the action information of the target object at different angles, then, based on the recognition processing of the computer device on the action fusion description graph, the computer device can acquire the action information corresponding to the target object at different shooting angles, so that the actions of the target object at different angles can be recognized, and the computer device can recognize the action of the target object at multiple angles based on the recognition processing of the action fusion description graph, so that the accuracy and the confidence coefficient of the computer device in action judgment of the target object can be improved.
Please refer to fig. 4, which is a schematic flowchart illustrating another method for motion recognition based on multi-angle video according to an embodiment of the present application, and as shown in fig. 4, the method may include:
s401, acquiring N detection videos shot by a target object at a plurality of different shooting angles, wherein N is a positive integer greater than or equal to 2.
In one embodiment, the detection video is obtained by shooting a target object by an image acquisition device, the target object is any object in an environment shot by the image acquisition device, and the image acquisition device can obtain a shot video corresponding to the shooting environment by performing image acquisition processing on the shooting environment. After the image acquisition device obtains the shot video corresponding to the shooting environment, image frame extraction processing and object recognition processing can be performed on the shot video, so that a video frame including the target object can be determined from a plurality of objects included in the shot video, and the video frame including the target object can be used as a detection video of the target object. In one implementation, after the image capturing device obtains the captured video corresponding to the capturing environment, the image capturing device may directly perform image frame extraction processing and object recognition processing on the captured video to determine a detection video corresponding to the target object from the captured video, and then the computer device may obtain a plurality of detection videos for the target object from different image capturing devices, and perform action determination on the target object based on the obtained plurality of detection videos. In another implementation, after obtaining the captured video of the capturing environment, the image capturing device may directly send the captured video to the computer device, so that the computer device performs image frame extraction and object recognition on the captured video, and obtains a detection video for the target object.
In one embodiment, the target object may be a preset object in a computer device (or an image capturing device), for example, when a fall detection is performed on an elderly person, the target object is an elderly person (i.e., a person object with an age greater than an age threshold), or the target object may also be set by a user, wherein the user may input an image of the target object into the computer device (or the image capturing device) when setting the target object, so that the computer device (or the image capturing device) may determine the target object through the input image, and further, obtain a detection video of the target object. In an embodiment, in the process of acquiring the detection video of the target object, if the captured video obtained by capturing the capturing environment by the image capturing device includes a plurality of different objects, the computer device may perform object identification processing and video frame extraction processing on the captured video including the plurality of different objects, so as to obtain the detection video including only the target object, or, in another implementation, if the detection video obtained by capturing the image capturing device includes a plurality of different objects, the computer device may also mark the target object included in the detection video, so as to distinguish the target object from other objects in the detection video, so as to use the detection video marked for the target object as the detection video of the target object.
In one embodiment, the number of the detected videos aiming at the target object acquired by the computer device is at least two, the at least two detection videos are obtained from different shooting angles, and the detection videos from different shooting angles can be obtained by the computer device from the same image acquisition device or obtained by the computer device from different image acquisition devices, when a plurality of detection videos for the target object are acquired from different image capturing apparatuses, the plurality of image acquisition devices are obtained by shooting the shooting environment at a fixed angle when shooting the shooting environment, and switching operation of the acquisition equipment is executed between two different image acquisition equipment based on the movement of the target object in the shooting environment, so that a plurality of detection videos aiming at the target object are obtained by the plurality of image acquisition equipment. For example, if it is determined that, based on the position of the shooting environment in which the target object is currently located and the capture range corresponding to any one of the plurality of image capture devices deployed for the shooting environment, the device currently acquiring the shooting video of the target object is the image capture device 1, then, along with the movement of the target object in the shooting environment, if the target object moves to the capture range corresponding to the image capture device 2, the image capture device 1 may send a capture instruction to the image capture device 2, perform the acquisition of the shooting video by the image capture device 2, and stop the image capture device 1 from performing the acquisition of the shooting video. Or when the target object moves to the capture range corresponding to the image capture device 2, the image capture device 1 and the control device (such as the above-mentioned computer device) corresponding to the image capture device 2 control the image capture device 2 to capture the target object, so as to obtain a plurality of detection videos corresponding to the target object, it should be noted that in this embodiment of the present application, a manner of obtaining the plurality of detection videos corresponding to the target object is not limited.
S402, performing image frame extraction processing on any one of the N detection videos to obtain one or more reference image frames of any one of the N detection videos.
And S403, acquiring an importance score corresponding to any reference image frame, wherein the importance score is used for representing the importance of the corresponding reference image frame.
S404, weighting and summing the corresponding reference image frames by adopting the importance scores to obtain action description graphs corresponding to any detection video, thereby obtaining M action description graphs, wherein one action description graph is used for recording action information of a target object in the detection video executed at a corresponding shooting angle.
In steps S402 to S404, after obtaining a plurality of detection videos for the target object, the computer device may further obtain one or more motion description maps corresponding to each detection video, in a specific implementation, the computer device may obtain the motion description map corresponding to each detection video by performing image frame extraction processing and weighted fusion processing on each detection video, and when performing image frame extraction processing on any detection video to obtain one or more reference image frames of any detection video, the computer device may first perform image frame extraction processing on any detection video based on a display order of each image frame in any detection video to obtain a characterization sequence corresponding to any detection video, it may be understood that the characterization sequence is an image frame sequence extracted from the detection video and capable of representing a motion performed on the target object, that is, the token sequence is a motion description sequence of the target object in the corresponding detection video. Then, after obtaining the characterization sequence, the computer device may take the image frames included in the characterization sequence as extracted one or more reference image frames, where the one or more image frames included in the characterization sequence determined by the computer device from any of the detection videos based on the image frame extraction process are consecutive, for example, the image frames included in the extracted characterization sequence may be 2, 3, 4, and so on of the detection video, respectively, or the one or more image frames included in the characterization sequence determined by the computer device from any of the detection videos by the image frame extraction process may also be discontinuous, for example, the image frames included in the extracted characterization sequence may also be 1, 3, 4, and so on of the detection video, respectively. That is, each image frame in the characterization sequence extracted from the detection video is a continuous or discontinuous image frame, but each image frame in the characterization sequence is necessarily an image frame including the target object. In one embodiment, the computer device may perform image frame extraction processing on any one of the detection videos based on experience values in advance, and extract a certain part of image frames from the detection videos as a characterization sequence, and in addition, when performing image frame extraction processing on any one of the detection videos and obtaining one or more reference image frames, the computer device may further obtain a sliding window and perform image frame extraction processing on any one of the detection videos by using the sliding window, in which case, the computer device may use the image frames in the sliding window as reference image frames.
In one embodiment, since the computing form of the computer device is equivalent to the linear summation of the original image frames when computing the motion description map, after the computer device extracts one or more reference image frames, the motion description map of the detected video may be obtained by performing weighted summation on the extracted reference image frames, and when performing weighted summation on the extracted reference image frames, the computer device needs to first obtain the importance scores of any one of the reference image frames, so as to perform weighted summation on the corresponding reference image frames by using the importance scores. When the computer device obtains the importance score of any reference image frame, a time sequence used for representing the display time of any reference image frame in the corresponding detection video can be obtained firstly, wherein one display time included in the time sequence corresponds to one reference image frame, that is, the time sequence obtained by the computer device is formed by the display time of one or more reference image frames in the detection video, and after the computer device obtains the time sequence, the time importance of any reference image frame can be harmonized based on the time sequence to obtain the time importance of the display time, so that the time importance can be used as the importance score of any reference image frame corresponding to the corresponding display time; the time importance is used for representing the time importance of the corresponding display time when the action description is performed, and it can be understood that the greater the time importance corresponding to one display time is, the more important the display time is when the action description is performed, that is, the greater the importance of the reference image frame associated with the display time is when the action description is performed.
In one implementationIn this example, if the computer device determines the display time of any reference image frame as t based on the time sequence, the time importance of the display time is determined by
Figure DEST_PATH_IMAGE001
When the computer device performs reconciliation processing on the display time and obtains the time importance of the display time, the computer device can obtain the time importance by using the formula shown in the formula 1.
Figure DEST_PATH_IMAGE002
Formula 1
Wherein HtTerm "tth" may be a Harmonic series (Harmonic number) obtained by performing Harmonic processing based on display time t, and may be specifically calculated by equation 2, and H0And = 0. In addition, T represents a display duration of the extracted one or more reference image frames, that is, a total duration corresponding to the time series.
Figure DEST_PATH_IMAGE003
Formula 2
Computer equipment obtains importance degree score of any reference image frame
Figure DEST_PATH_IMAGE004
Then, when the importance scores are used for weighting and summing the corresponding reference image frames to obtain the action description map of any detection video, the computer device may first determine an algorithm used for calculating the action description map, and in one implementation manner, if the computer device uses a first algorithm for calculating the action description map, the computer device may obtain a feature vector of each reference image frame in one or more reference image frames, so that the computer device may use the importance scores of any reference image frame to perform weighting and summing on the corresponding feature vectors to obtain a feature vector of any detection video, and further may perform reduction and reconstruction processing on a plurality of pairs of the feature to obtain the action description map of any detection video. Wherein the action description is performed by a first algorithmThe calculation formula of the map calculation is constructed based on the feature vectors of the reference image frame, and the first expression of the first algorithm is formula 3.
Figure DEST_PATH_IMAGE005
Formula 3
Wherein d is*Representing a motion description diagram, ItRepresenting any one of one or more reference image frames obtained by a computer device from a detected video,
Figure DEST_PATH_IMAGE006
representing images from either of the reference image frames ItAs can be seen from equation 3, the motion description map of any detection video is equivalent to a feature vector obtained by performing weighted summation on feature vectors of corresponding reference image frames according to the importance scores corresponding to one or more reference image frames extracted based on the detection video, that is, based on the importance score corresponding to each reference image frame, the feature vectors of the corresponding reference image frames can be weighted and further subjected to summation processing to obtain a feature vector corresponding to the detection video.
In one embodiment, the computer device may invoke code as shown in fig. 5a in determining the determination of the action profile using the first algorithm after acquiring the one or more reference image frames and the corresponding importance scores. In another implementation, if the computer device performs the calculation of the motion description map by using the second algorithm, the computer device may perform similar conversion on the importance scores to obtain the similarity scores of the importance scores, and further may perform weighted summation on corresponding reference image frames by using the similarity scores of any reference image frame corresponding to the importance scores to obtain the motion description map of any detection video.
In a specific implementation, based on the expression of the importance score in the above formula 1, the computer device may perform similar transformation on the importance score based on the value of t to obtain a corresponding similar score, where the obtained similar score may be represented by the following formula
Figure DEST_PATH_IMAGE007
And specifically can be represented by formula 4.
Figure DEST_PATH_IMAGE008
Formula 4
Wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE009
a similarity score representing the resulting importance score. Then, based on the translation invariance of the equation of equation 4, the computer device may implement dynamic update of the motion description map, so as to obtain a second algorithm for fast computation of the motion description map, where the second expression corresponding to the second algorithm may be as shown in equation 5.
Figure DEST_PATH_IMAGE010
Formula 5
Where L represents the number of extracted reference image frames or the window size of the sliding window, and when the window size of the sliding window is L, the determined reference image frames can be represented as
Figure DEST_PATH_IMAGE011
. Similarly, if the computer device determines to use the second algorithm to perform the calculation of the action description graph, it may be implemented by the code shown in fig. 5 b.
Practice shows that when the action description diagram corresponding to the detection video is obtained by the first algorithm, the computer device obtains the feature vectors of the corresponding reference image frames by weighting and summing the accurate importance scores corresponding to the reference image frames extracted from the detection video, and when the action description diagram corresponding to the detection video is obtained by the second algorithm, the computer device obtains the corresponding reference image frames by weighting and summing the similarity scores corresponding to the importance scores of the reference image frames extracted from the detection video The motion description graph obtained by the second algorithm is higher in accuracy of describing the motion of the target object, but the speed of obtaining the motion description graph of the detection video by the first algorithm is slower than that of obtaining the motion description graph of the detection video by the second algorithm. That is, the computer device obtains the action discrimination result for the target object by the recognition of the action profile obtained by the first algorithm more accurately than the action discrimination result for the target object by the recognition of the action profile obtained by the second algorithm. Then, after obtaining a plurality of detection videos for the target object, when obtaining the action description map corresponding to the detection videos, the computer device may select a first algorithm or a second algorithm to generate the action description map corresponding to the detection videos according to the accuracy requirement (or the operation capability requirement for the action description map) for the current action determination for the target object, where the computer device may perform the calculation of the action description map by using the second algorithm when the accuracy requirement for the current action determination for the target object is low but the requirement for the operation speed is high, and may extract the first algorithm to perform the calculation of the action description map when the accuracy requirement for the current action determination for the target object is high but the requirement for the speed is relatively low.
The following explains the inference process of the first expression of the action description diagram shown in equation 3. Since a video (such as the detection video described above) may be represented by a time sequence of still images, the detection video may be represented by a sequence of frames
Figure DEST_PATH_IMAGE012
Is expressed by the ranking function of (1), wherein IiThe sorting function of the frame sequence is used for representing one or more reference image frames extracted from the detection video by adopting an image frame extraction process when the ith (or corresponding display time i) image frame in the detection video is used for carrying out target motion detection on a target object based on the detection video. In addition, the computer equipment can be used
Figure DEST_PATH_IMAGE013
Representing each individual frame I from the videotThe extracted feature vector is obtained, wherein,
Figure DEST_PATH_IMAGE014
representing a real number set, the computer device may, upon determining a first expression of the motion profile, determine a theoretical expression of the motion profile based on the determined representation for the reference image frame and the corresponding feature vector, and then infer the first expression of the motion profile based on the theoretical expression of the motion profile.
In an embodiment, when obtaining the first expression of the action description graph, the computer device may first perform time averaging processing on any reference image frame according to display time of any reference image frame to obtain a conversion relationship between any reference image frame and a corresponding feature vector, where the conversion relationship between the reference image frame and the corresponding feature vector obtained by the computer device may be as shown in formula 6.
Figure DEST_PATH_IMAGE015
Formula 6
Wherein, VtPresentation pairAny reference image frame is subjected to time average processing to obtain an image frame,
Figure DEST_PATH_IMAGE016
representing an original reference image frame ItThe feature vector of (2).
After the computer device obtains the conversion relationship, since one motion description map is used for recording motion information of a target object in a corresponding detection video executed at a corresponding shooting angle, motion information of the same motion at different time sequences is necessarily different based on the association relationship between the motion and the time sequence (or time), that is, the time sequence relationship between each image frame in the detection video will affect the motion information recorded in the corresponding obtained motion description map. Then, after obtaining the transformation relationship between the reference image frame and the corresponding feature correlation, the computer device may construct a constraint function for the action profile based on the timing relationship between each of the one or more reference image frames, the constraint function indicating: if the display time of one reference image frame in any detection video is longer than that of another reference image frame in any detection video, in the process of constructing the action description map corresponding to any detection video, the corresponding score in the one reference image frame is longer than that in the other reference image frame; and then, the constraint function can be solved to obtain a theoretical expression of the action description diagram. In one embodiment, the constraint function that the computer device constructs based on the timing relationship between each of the one or more reference image frames may be
Figure DEST_PATH_IMAGE017
Since the motion description graph that the computer device needs to obtain is to extract dynamic features in time sequence from the detected video, the constraint function is equivalent to equation 7 in order to introduce dynamic information.
Figure DEST_PATH_IMAGE018
Formula 7
Wherein s.t. represents such that a q value greater than t in time is defined, S (q | d) represents a score corresponding to a reference image frame with a display time q during the process of constructing the action description diagram (i.e. the image indicated by the parameter d in the above equation 7), and S (t | d) represents a score corresponding to a reference image frame with a display time t during the process of constructing the action description diagram. It can be seen that, the theoretical expression of the action description graph can be obtained by solving the constraint function shown in equation 7, and in an embodiment, the computer device can solve the theoretical expression d of the action description graph by using a rank svm method (a machine learning method), where the solved theoretical expression of the action description graph is shown in equation 8.
Figure DEST_PATH_IMAGE019
Formula 8
Wherein, | | | means norm calculation, and the specific calculation formula of l (d) can be calculated by formula 9.
Figure DEST_PATH_IMAGE020
Formula 9
The derivation process described above may be referred to as a sort pooling operation, allowing the computer device to map a sequence of T reference image frames to a feature vector d*And then, the motion description diagram of the detection video where the reference image frame is located can be obtained by performing reduction reconstruction processing on the feature vector obtained by mapping. Since the acquisition of the action description graph by using the theoretical expression of the action description graph consumes a large amount of operation resources of the computer device, the computer device can increase the speed of acquiring the dynamic graph to meet the real-time requirement of action judgment on the target object, and therefore, after the computer device obtains the theoretical expression of the action description graph, approximate derivation can be performed on the basis of the theoretical expression, so that the first expression for calculating the action description graph can be determined according to the approximate expression obtained by the approximate derivation. In particular implementations, the computer device may obtain the action profile first when determining the first expression of the action profileAnd calculating a theoretical expression, and performing one-step approximate derivation processing on the theoretical expression to obtain an approximate expression of the action description graph, wherein the approximate expression is used for indicating the gradient change of pixels of the action description graph. Wherein, when the computer device carries out one-step approximate derivation processing based on the theoretical expression, the method can be used for obtaining the derivative
Figure DEST_PATH_IMAGE021
The process is started, and therefore, the approximate expression obtained by performing the one-step approximation derivation process can be expressed as formula 10.
Figure DEST_PATH_IMAGE022
Formula 10
Wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE023
is any minimum value greater than 0 and,
Figure DEST_PATH_IMAGE024
the method includes representing gradient changes of corresponding pixels in the action description map, and determining, by the computer device, a reference image frame with precedence relationship of corresponding display time from the one or more reference image frames based on the display time corresponding to each reference image frame in the one or more reference image frames after obtaining an approximate expression of the action description map, and calculating an image difference value between the reference image frames with precedence relationship of corresponding display time, because the pixel values of the corresponding pixels in the action description map are action information executed by a target object in the recorded reference image frames. After the computer device determines the image difference values between the reference image frames with the precedence relationship corresponding to the display time, the image difference values between the reference image frames with the precedence relationship corresponding to the display time can be adopted to replace the gradient change of the corresponding pixels in the approximate expression of the action description diagram, so that the gradient change of the corresponding pixels in the action description diagram can be determined according to the replaced expressionThe action describes a first expression of the graph. Wherein, the expression of the image difference between the reference image frames having the precedence relationship corresponding to the display time satisfies equation 11.
Figure DEST_PATH_IMAGE025
Formula 11
Wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE026
in the case of a direct proportion operator, the above equation 11 indicates that the motion describes a gradient change of a pixel change in the image, and is directly proportional to a sum of image differences between reference image frames having a precedence relationship with respect to corresponding display times in the one or more reference image frames. Then, the computer device may use the expression of the image difference between the reference image frames with the precedence relationship corresponding to the display time as represented by equation 11 to perform similar replacement on the approximate expression of the action description graph, so as to obtain the first expression of the action description graph.
In one embodiment, when the computer device performs similar replacement on the gradient change represented in the approximate expression by using the expression of the image difference to obtain the first expression of the action description graph, the computer device may first perform similar replacement on the gradient change represented in the approximate expression by using the expression of the image difference to obtain an equivalent expression of the action description graph, where the equivalent expression is used for representing an action description graph obtained based on weighted summation processing on each reference image frame, where the equivalent expression may be as shown in equation 12.
Figure DEST_PATH_IMAGE027
Formula 12
Wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE028
then, after obtaining the equivalent expression, the computer device may further base the rotation between any reference image frame and the corresponding feature vector as shown in equation 6And (4) changing the relation, carrying out equation conversion on the same expression to obtain a first expression of the action description graph.
And S405, performing information fusion processing on the motion information recorded in each motion description map to obtain motion fusion description maps for describing the motion information of the target object at a plurality of different shooting angles.
S406, identifying the action executed by the target object according to the action fusion description graph to obtain an action identification result aiming at the target object, wherein the action identification result is used for representing whether the target object executes the target action.
In step S405 and step S406, after obtaining that any one of the collected detection videos is a corresponding action description map, the computer device may process the action description map of any one of the detection videos by calling one branch network of the multi-branch convolutional neural network, and transmit the processed action description map to the main network to obtain an action fusion description map of the target object. In one embodiment, when the computer device obtains the action description map corresponding to any detected video, any detection video can be input into one or more branch networks of the multi-branch convolutional neural network, so as to obtain one or more action description maps corresponding to the detection video, wherein, after the computer device inputs the detection video into a branch network, an action description chart aiming at the detection video can be output by the branch network, then, if the computer apparatus inputs one detection video to a plurality of branch networks, a plurality of action profiles of the one detection video can be obtained from the plurality of branch networks, that is, if the number of the detected videos acquired by the computer device is N, the number M of the action description graphs obtained after the computer device passes through the multi-branch convolutional neural network is inevitably greater than or equal to N.
In an embodiment, if a computer device inputs a detection video into a plurality of different branch networks and obtains a plurality of action description diagrams of the detection video, the computer device may select one action description diagram from the obtained plurality of action description diagrams as an action description diagram for subsequent information fusion processing, or the computer device may select one action description diagram for subsequent information fusion processing from the plurality of action description diagrams according to a certain specific selection rule, or the computer device may also use all the obtained plurality of action description diagrams as action description diagrams for subsequent information fusion processing, in the embodiment of the present application, after the computer device obtains one or more action description diagrams corresponding to each detection video, the number and the determination manner of the action description diagrams for subsequent fusion of each detection video are not limited, however, in the embodiment of the present application, a case where an acquired detection video is input to a branch network and a corresponding action description diagram is obtained is mainly described, that is, a case where the number of the detection videos acquired by the computer device is N, and after image frame extraction processing and weighted fusion processing are performed on the N detection videos, M action description diagrams equal to the N number are obtained is described.
In one embodiment, since the multi-branch convolutional neural network further includes a backbone network, when obtaining an action description map corresponding to a detected video and performing information fusion processing on action information recorded in each action description map, the computer device may call any branch network to extract action information recorded in any action description map of the M action description maps, so as to obtain action information of any action description map at a corresponding shooting angle; further, the computer device can perform information fusion processing on the action information extracted from each action description map under the corresponding shooting angle by using a backbone network to obtain an action fusion description map for describing action information of the target object at a plurality of different shooting angles. As shown in fig. 5c, each branch of the multi-branch convolutional neural network may process a detection video from an image capture device, that is, the computer device may process the detection video by using a shallow layer convolution to extract a low frequency signal in a feature image, merge and merge the visual features processed by each branch network by using a pooling operation, and input the merged visual features into the main network, so that the main network may determine whether a target object performs a target action based on the merged action information (i.e., the action merge description graph).
In one embodiment, the multi-branch convolutional neural network is a trained deep learning network, and since the number of frames of the target action may occupy a smaller proportion of the extracted reference image frame, when the multi-branch convolutional neural network is trained, a category-weighted focal loss (center loss) function may be used for model training, thereby improving the accuracy of the trained model in action judgment.
For the problem of two classifications of detecting whether a target object in a video is moved by a target or not, common two indexes of specificity (TN/(TN + FP)) and sensitivity (TP/(TP + FN)) can be adopted to evaluate a judgment result, wherein the specificity represents the proportion of the divided pairs in all negative examples, and the identification capacity of a classifier on the negative examples is measured; the sensitivity represents the proportion of the divided pairs in all the positive examples, the recognition capability of the classifier on the positive examples is measured, the two indexes are as good as larger, as shown in fig. 5d, the performance of target motion detection on a target object based on the scheme is optimal compared with other schemes, after the target motion detection is performed on the target object, the computer device can also input the detection result as shown in fig. 5e, and based on the detection judgment on whether the target object performs the target motion, the computer device can adopt a rectangular solid frame as shown in fig. 5e for labeling when determining that the target object performs the target motion, and adopt a rectangular dotted frame as shown in fig. 5e for labeling the target object when determining that the target object does not perform the target motion.
In the embodiment of the application, in the process of detecting the target action of the target object, the computer device can acquire a plurality of detection videos aiming at different angles of the target object from different image acquisition devices, so that the computer device can acquire image information shot aiming at multiple visual angles of the target object, the problem of detection errors caused by image shielding in the subsequent action detection process can be effectively avoided, the generation of image shadow can be reduced based on the acquisition of the detection videos aiming at different angles, and the computer device can judge the action of the target object based on the detection videos acquired by other normal image acquisition devices under the condition that partial image acquisition devices have faults. After the computer device acquires a plurality of detection videos for the target object, the computer device may extract any one of the detection videos to obtain one or more corresponding reference image frames, and after determining an importance score corresponding to each reference image frame, perform weighted fusion processing on the corresponding reference image frame using the corresponding importance score to obtain an action description map of the detection video, and further, the computer device may obtain an action fusion description map for describing action information of the target object at a plurality of different angles based on information fusion processing on the action information in the plurality of action description maps, and further may identify the action fusion description map to obtain an action determination result for the target object, based on the fusion of the plurality of action description maps by the computer device, the computer device may acquire action information for a plurality of shooting angles for the target object, therefore, the accuracy of the action recognition of the computer equipment can be effectively improved through the recognition processing of the action information of the multiple shooting angles.
Based on the description of the embodiment of the motion recognition method based on multi-angle video, the embodiment of the present invention further provides a motion recognition apparatus, which may be a computer program (including program code) running in the computer device. The motion recognition apparatus may be used to execute the motion recognition method for the object as described in fig. 2 and fig. 4, referring to fig. 6, the motion recognition apparatus for the object includes: an acquisition unit 601, a processing unit 602 and a recognition unit 603.
An obtaining unit 601, configured to obtain N detection videos obtained by shooting a target object at multiple different shooting angles, where N is a positive integer greater than or equal to 2;
a processing unit 602, configured to perform image frame extraction processing and weighted fusion processing on the N detection videos to obtain M action description maps, where M is a positive integer, and one action description map is used to record action information of a target object in a detection video executed at a corresponding shooting angle;
the processing unit 602 is further configured to perform information fusion processing on the motion information recorded in each motion description map to obtain a motion fusion description map used for describing motion information of the target object at multiple different shooting angles;
the identifying unit 603 is configured to perform identification processing on the action performed by the target object according to the action fusion description map, so as to obtain an action identification result for the target object, where the action identification result is used to represent whether the target object performs a target action.
In one embodiment, any action description graph is obtained by calling a branch network of a multi-branch convolutional neural network to perform image frame extraction processing and weighting fusion processing on any detection video, and the multi-branch convolutional neural network further comprises a main network; the processing unit 602 is further configured to:
calling any branch network to extract the action information recorded by any action description diagram in the M action description diagrams to obtain the action information of any action description diagram at the corresponding shooting angle;
and calling the backbone network to perform information fusion processing on the action information extracted from each action description diagram under the corresponding shooting angle to obtain an action fusion description diagram for describing the action information of the target object at a plurality of different shooting angles.
In one embodiment, an action description map is obtained by performing image frame extraction processing and weighted fusion processing on a detection video;
the processing unit 602 is further configured to perform image frame extraction processing on any detected video to obtain one or more reference image frames of the any detected video;
the obtaining unit 601 is further configured to obtain an importance score corresponding to any reference image frame, where the importance score is used to represent the importance of the corresponding reference image frame;
the processing unit 602 is further configured to perform weighted summation on the corresponding reference image frame by using the importance score, so as to obtain an action description map corresponding to any one of the detection videos.
In an embodiment, the processing unit 602 is specifically configured to:
performing image frame extraction processing on any detection video based on the display sequence of each image frame in any detection video to obtain a representation sequence corresponding to any detection video, and taking the image frame in the representation sequence as a reference image frame, or;
and acquiring a sliding window, carrying out image frame extraction processing on any detection video based on the sliding window, and taking the image frame in the sliding window as a reference image frame.
In an embodiment, the obtaining unit 601 is specifically configured to:
acquiring a time sequence for representing the display time of any reference image frame in the one or more reference image frames in the corresponding detection video, wherein the time sequence comprises a display time corresponding to one reference image frame;
on the basis of the time sequence, carrying out harmonic processing on the display time of any reference image frame to obtain the time importance degree used for representing the corresponding display time in action description;
and taking the time importance as an importance score of any reference image frame corresponding to the corresponding display time.
In an embodiment, the processing unit 602 is specifically configured to:
if the action description map is obtained by adopting a first algorithm, acquiring a feature vector of each reference image frame in the one or more reference image frames;
weighting and summing corresponding characteristic vectors by adopting the importance degree scores of any reference image frame to obtain a characterization vector of any detection video;
and restoring and reconstructing the characterization vector to obtain an action description diagram of any one detection video.
In one embodiment, the first algorithm is represented by a first expression; the obtaining unit 601 is further configured to obtain a theoretical expression of the action description graph, and perform one-step approximate derivation processing on the theoretical expression to obtain an approximate expression of the action description graph, where the approximate expression is used to represent a gradient change of a corresponding pixel change in the action description graph;
the processing unit 602 is further configured to determine, based on the display time corresponding to each reference image frame in the one or more reference image frames, a reference image frame with a precedence relationship corresponding to the display time from the one or more reference image frames, and calculate an image difference between the reference image frames with the precedence relationship corresponding to the display time;
the processing unit 602 is further configured to perform similar replacement on the gradient change represented in the approximate expression by using the expression of the image difference value, and determine the first expression of the action description graph according to the expression after the similar replacement.
In an embodiment, the processing unit 602 is specifically configured to:
performing similar replacement on the gradient change represented in the approximate expression by adopting the expression of the image difference value to obtain an equivalent expression of the action description graph, wherein the equivalent expression is used for representing and performing weighted summation processing on each reference image frame to obtain the action description graph;
and carrying out equation conversion on the equivalent expression based on the conversion relation between any reference image frame and the corresponding feature vector to obtain a first expression of the action description diagram.
In an embodiment, the obtaining unit 601 is specifically configured to:
according to the display time of any reference image frame, carrying out time averaging processing on any reference image frame to obtain a conversion relation between any reference image frame and a corresponding feature vector;
constructing a constraint function for an action description graph based on a timing relationship between image frames in the one or more reference image frames, the constraint function indicating: if the display time of one reference image frame in any detection video is longer than that of another reference image frame in any detection video, in the process of constructing the action description map corresponding to any detection video, the corresponding score of the one reference image frame is longer than that of the another reference image frame;
and solving the constraint function to obtain a theoretical expression of the action description diagram.
In an embodiment, the processing unit 601 is specifically configured to:
if an action description graph is obtained by adopting a second algorithm, performing similar conversion on the importance scores to obtain similar scores of the importance scores;
and weighting and summing the corresponding reference image frames by adopting the similarity scores of the corresponding importance scores of any reference image frame to obtain an action description diagram of any detection video.
In the embodiment of the present application, in the process of performing motion recognition on a target object to determine whether a motion performed by the target object is a target motion, the obtaining unit 601 may first obtain a plurality of detection videos obtained by capturing the target object at different angles, and then the processing unit 602 may perform image frame extraction processing and weighted fusion processing on each of the obtained detection videos to obtain a plurality of motion description maps, and after obtaining the motion description map of each detection video, based on a record of information about a motion performed by the target object at a corresponding capturing angle in the motion description map, the processing unit 602 may further perform information fusion processing on the obtained motion description map to obtain a motion fusion description map of the target object, so that the recognition unit 603 may obtain motion information of the target object at different angles, then, based on the recognition processing of the recognition unit 603 on the motion fusion description map, the recognition unit 603 can obtain the corresponding motion information of the target object at different shooting angles, so that the motions of the target object at different angles can be recognized, and based on the recognition processing of the motion fusion description map, the recognition of the motion of the target object at multiple angles can be realized, and the accuracy and the confidence of the motion discrimination of the target object can be improved.
Fig. 7 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device in the present embodiment shown in fig. 7 may include: one or more processors 701; one or more input devices 702, one or more output devices 703, and memory 704. The processor 701, the input device 702, the output device 703, and the memory 704 are connected by a bus 705. The memory 704 is used to store a computer program comprising program instructions, and the processor 701 is used to execute the program instructions stored by the memory 704.
The memory 704 may include volatile memory (volatile memory), such as random-access memory (RAM); the memory 704 may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), etc.; the memory 704 may also comprise a combination of the above types of memory.
The processor 701 may be a Central Processing Unit (CPU). The processor 701 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or the like. The PLD may be a field-programmable gate array (FPGA), a General Array Logic (GAL), or the like. The processor 701 may also be a combination of the above structures.
In an embodiment of the present invention, the memory 704 is configured to store a computer program, the computer program includes program instructions, and the processor 701 is configured to execute the program instructions stored in the memory 704, so as to implement the steps of the corresponding methods as described above in fig. 2 and fig. 4.
In one embodiment, the processor 701 is configured to call the program instructions to perform:
acquiring N detection videos shot by a target object at a plurality of different shooting angles, wherein N is a positive integer greater than or equal to 2;
performing image frame extraction processing and weighted fusion processing on the N detection videos to obtain M action description graphs, wherein M is a positive integer, and one action description graph is used for recording action information executed by a target object in the detection videos at a corresponding shooting angle;
performing information fusion processing on the action information recorded by each action description diagram to obtain an action fusion description diagram used for describing the action information of the target object at a plurality of different shooting angles;
and identifying the action executed by the target object according to the action fusion description graph to obtain an action identification result aiming at the target object, wherein the action identification result is used for representing whether the target object executes the target action.
In one embodiment, any action description graph is obtained by calling a branch network of a multi-branch convolutional neural network to perform image frame extraction processing and weighting fusion processing on any detection video, and the multi-branch convolutional neural network further comprises a main network; the processor 701 is configured to call the program instructions for performing:
calling any branch network to extract the action information recorded by any action description diagram in the M action description diagrams to obtain the action information of any action description diagram at the corresponding shooting angle;
and calling the backbone network to perform information fusion processing on the action information extracted from each action description diagram under the corresponding shooting angle to obtain an action fusion description diagram for describing the action information of the target object at a plurality of different shooting angles.
In one embodiment, an action description map is obtained by performing image frame extraction processing and weighted fusion processing on a detection video; the processor 701 is configured to call the program instructions for performing:
carrying out image frame extraction processing on any detection video to obtain one or more reference image frames of any detection video;
acquiring an importance score corresponding to any reference image frame, wherein the importance score is used for representing the importance of the corresponding reference image frame;
and carrying out weighted summation on the corresponding reference image frames by adopting the importance degree scores to obtain an action description diagram corresponding to any detection video.
In one embodiment, the processor 701 is configured to call the program instructions to perform:
performing image frame extraction processing on any detection video based on the display sequence of each image frame in any detection video to obtain a representation sequence corresponding to any detection video, and taking the image frame in the representation sequence as a reference image frame, or;
and acquiring a sliding window, carrying out image frame extraction processing on any detection video based on the sliding window, and taking the image frame in the sliding window as a reference image frame.
In one embodiment, the processor 701 is configured to call the program instructions to perform:
acquiring a time sequence for representing the display time of any reference image frame in the one or more reference image frames in the corresponding detection video, wherein the time sequence comprises a display time corresponding to one reference image frame;
on the basis of the time sequence, carrying out harmonic processing on the display time of any reference image frame to obtain the time importance degree used for representing the corresponding display time in action description;
and taking the time importance as an importance score of any reference image frame corresponding to the corresponding display time.
In one embodiment, the processor 701 is configured to call the program instructions to perform:
if the action description map is obtained by adopting a first algorithm, acquiring a feature vector of each reference image frame in the one or more reference image frames;
weighting and summing corresponding characteristic vectors by adopting the importance degree scores of any reference image frame to obtain a characterization vector of any detection video;
and restoring and reconstructing the characterization vector to obtain an action description diagram of any one detection video.
In one embodiment, the first algorithm is represented by a first expression; the processor 701 is configured to call the program instructions for performing:
obtaining a theoretical expression of an action description graph, and performing one-step approximate derivation processing on the theoretical expression to obtain an approximate expression of the action description graph, wherein the approximate expression is used for representing gradient change of corresponding pixel change in the action description graph;
determining reference image frames with precedence relation corresponding to the display time from the one or more reference image frames based on the display time corresponding to each reference image frame in the one or more reference image frames, and calculating image difference values between the reference image frames with precedence relation corresponding to the display time;
and performing similar replacement on the gradient change represented in the approximate expression by adopting the expression of the image difference value, and determining a first expression of the action description graph according to the expression after the similar replacement.
In one embodiment, the processor 701 is configured to call the program instructions to perform:
performing similar replacement on the gradient change represented in the approximate expression by adopting the expression of the image difference value to obtain an equivalent expression of the action description graph, wherein the equivalent expression is used for representing and performing weighted summation processing on each reference image frame to obtain the action description graph;
and carrying out equation conversion on the equivalent expression based on the conversion relation between any reference image frame and the corresponding feature vector to obtain a first expression of the action description diagram.
In one embodiment, the processor 701 is configured to call the program instructions to perform:
according to the display time of any reference image frame, carrying out time averaging processing on any reference image frame to obtain a conversion relation between any reference image frame and a corresponding feature vector;
constructing a constraint function for an action description graph based on a timing relationship between image frames in the one or more reference image frames, the constraint function indicating: if the display time of one reference image frame in any detection video is longer than that of another reference image frame in any detection video, in the process of constructing the action description map corresponding to any detection video, the corresponding score of the one reference image frame is longer than that of the another reference image frame;
and solving the constraint function to obtain a theoretical expression of the action description diagram.
In one embodiment, the processor 701 is configured to call the program instructions to perform:
if an action description graph is obtained by adopting a second algorithm, performing similar conversion on the importance scores to obtain similar scores of the importance scores;
and weighting and summing the corresponding reference image frames by adopting the similarity scores of the corresponding importance scores of any reference image frame to obtain an action description diagram of any detection video.
Embodiments of the present invention provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method embodiments as shown in fig. 2 or fig. 4. The computer-readable storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
While the invention has been described with reference to a particular embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (12)

1. A motion recognition method based on multi-angle video is characterized by comprising the following steps:
acquiring N detection videos shot by a target object at a plurality of different shooting angles, wherein N is a positive integer greater than or equal to 2;
performing image frame extraction processing on any one of the N detection videos to obtain one or more reference image frames of the any one detection video, and performing weighting fusion processing on the corresponding reference image frames according to the importance degree fraction of any one reference image frame to obtain M action description graphs, wherein M is a positive integer, and one action description graph is used for recording action information executed by a target object in the detection video at a corresponding shooting angle;
performing information fusion processing on the action information recorded by each action description diagram to obtain an action fusion description diagram used for describing the action information of the target object at a plurality of different shooting angles;
identifying the action executed by the target object according to the action fusion description graph to obtain an action identification result aiming at the target object, wherein the action identification result is used for representing whether the target object executes the target action;
wherein, obtaining the importance degree score of any reference image frame comprises the following steps: acquiring a time sequence for representing the display time of any reference image frame in the one or more reference image frames in the corresponding detection video, wherein the time sequence comprises a display time corresponding to one reference image frame; on the basis of the time sequence, carrying out harmonic processing on the display time of any reference image frame to obtain the time importance degree used for representing the corresponding display time in action description; and taking the time importance as an importance score of any reference image frame corresponding to the corresponding display time.
2. The method of claim 1, wherein any motion profile is obtained by performing image frame extraction and weighted fusion on any detected video by calling a branch network of a multi-branch convolutional neural network, the multi-branch convolutional neural network further comprising a trunk network; the motion information recorded by each motion description map is subjected to information fusion processing to obtain a motion fusion description map used for describing the motion information of the target object at a plurality of different shooting angles, and the motion fusion description map comprises the following steps:
calling any branch network to extract the action information recorded by any action description diagram in the M action description diagrams to obtain the action information of any action description diagram at the corresponding shooting angle;
and calling the backbone network to perform information fusion processing on the action information extracted from each action description diagram under the corresponding shooting angle to obtain an action fusion description diagram for describing the action information of the target object at a plurality of different shooting angles.
3. The method of claim 1, wherein an action profile is obtained by performing image frame extraction processing and weighted fusion processing on a detected video; the importance scores are used to characterize the importance of the respective reference image frames.
4. The method as claimed in claim 3, wherein said performing image frame extraction processing on any detected video to obtain one or more reference image frames of said any detected video comprises:
performing image frame extraction processing on any detection video based on the display sequence of each image frame in any detection video to obtain a representation sequence corresponding to any detection video, and taking the image frame in the representation sequence as a reference image frame, or;
and acquiring a sliding window, carrying out image frame extraction processing on any detection video based on the sliding window, and taking the image frame in the sliding window as a reference image frame.
5. The method of claim 3, wherein the obtaining the action description map corresponding to any detected video by weighted summation of the corresponding reference image frames with the importance scores comprises:
if the first algorithm is adopted to obtain the action description diagram, the steps are as follows: obtaining a feature vector of each of the one or more reference image frames;
weighting and summing corresponding characteristic vectors by adopting the importance degree scores of any reference image frame to obtain a characterization vector of any detection video;
and restoring and reconstructing the characterization vector to obtain an action description diagram of any one detection video.
6. The method of claim 5, wherein the first algorithm is represented by a first expression; the method further comprises the following steps:
obtaining a theoretical expression of an action description graph, and performing one-step approximate derivation processing on the theoretical expression to obtain an approximate expression of the action description graph, wherein the approximate expression is used for representing gradient change of corresponding pixel change in the action description graph;
determining reference image frames with precedence relation corresponding to the display time from the one or more reference image frames based on the display time corresponding to each reference image frame in the one or more reference image frames, and calculating image difference values between the reference image frames with precedence relation corresponding to the display time;
and performing similar replacement on the gradient change represented in the approximate expression by adopting the expression of the image difference value, and determining a first expression of the action description graph according to the expression after the similar replacement.
7. The method of claim 6, wherein said similarly replacing the gradient changes characterized in the approximate expression with the expression for the image difference values and determining the first expression for the motion profile based on the similarly replaced expression comprises:
performing similar replacement on the gradient change represented in the approximate expression by adopting the expression of the image difference value to obtain an equivalent expression of the action description graph, wherein the equivalent expression is used for representing and performing weighted summation processing on each reference image frame to obtain the action description graph;
and carrying out equation conversion on the equivalent expression based on the conversion relation between any reference image frame and the corresponding feature vector to obtain a first expression of the action description diagram.
8. The method of claim 6, wherein obtaining a theoretical expression of an action description graph comprises:
according to the display time of any reference image frame, carrying out time averaging processing on any reference image frame to obtain a conversion relation between any reference image frame and a corresponding feature vector;
constructing a constraint function for an action description graph based on a timing relationship between image frames in the one or more reference image frames, the constraint function indicating: if the display time of one reference image frame in any detection video is longer than that of another reference image frame in any detection video, in the process of constructing the action description map corresponding to any detection video, the corresponding score of the one reference image frame is longer than that of the another reference image frame;
and solving the constraint function to obtain a theoretical expression of the action description diagram.
9. The method as claimed in claim 1, wherein the performing of weighted fusion processing on the corresponding reference image frame according to the importance score of any reference image frame comprises:
if the action description graph is obtained by adopting a second algorithm, the steps are as follows: performing similarity conversion on the importance scores to obtain the similarity scores of the importance scores;
and weighting and summing the corresponding reference image frames by adopting the similarity scores of the corresponding importance scores of any reference image frame.
10. An action recognition device based on multi-angle video, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring N detection videos obtained by shooting a target object at a plurality of different shooting angles, wherein N is a positive integer greater than or equal to 2;
the processing unit is used for performing image frame extraction processing on any one of the N detection videos to obtain one or more reference image frames of the any one detection video, and performing weighted fusion processing on the corresponding reference image frames according to the importance scores of the any reference image frames to obtain M action description graphs, wherein M is a positive integer, and one action description graph is used for recording action information executed by a target object in the detection video at a corresponding shooting angle;
the processing unit is further configured to perform information fusion processing on the motion information recorded in each motion description map to obtain a motion fusion description map used for describing the motion information of the target object at a plurality of different shooting angles;
the identification unit is used for identifying the action executed by the target object according to the action fusion description graph to obtain an action identification result aiming at the target object, and the action identification result is used for representing whether the target object executes the target action;
wherein, obtaining the importance degree score of any reference image frame comprises the following steps: acquiring a time sequence for representing the display time of any reference image frame in the one or more reference image frames in the corresponding detection video, wherein the time sequence comprises a display time corresponding to one reference image frame; on the basis of the time sequence, carrying out harmonic processing on the display time of any reference image frame to obtain the time importance degree used for representing the corresponding display time in action description; and taking the time importance as an importance score of any reference image frame corresponding to the corresponding display time.
11. A computer device comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, wherein the memory is configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1 to 9.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method according to any one of claims 1 to 9.
CN202111241878.0A 2021-10-25 2021-10-25 Multi-angle video-based action identification method and related equipment Active CN113688804B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111241878.0A CN113688804B (en) 2021-10-25 2021-10-25 Multi-angle video-based action identification method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111241878.0A CN113688804B (en) 2021-10-25 2021-10-25 Multi-angle video-based action identification method and related equipment

Publications (2)

Publication Number Publication Date
CN113688804A CN113688804A (en) 2021-11-23
CN113688804B true CN113688804B (en) 2022-02-11

Family

ID=78587813

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111241878.0A Active CN113688804B (en) 2021-10-25 2021-10-25 Multi-angle video-based action identification method and related equipment

Country Status (1)

Country Link
CN (1) CN113688804B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115937743B (en) * 2022-12-09 2023-11-14 武汉星巡智能科技有限公司 Infant care behavior identification method, device and system based on image fusion
CN116152723B (en) * 2023-04-19 2023-06-27 深圳国辰智能系统有限公司 Intelligent video monitoring method and system based on big data

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109558781B (en) * 2018-08-02 2021-07-02 北京市商汤科技开发有限公司 Multi-view video identification method and device, equipment and storage medium
US10803336B2 (en) * 2018-08-08 2020-10-13 Google Llc Multi-angle object recognition
CN109711320B (en) * 2018-12-24 2021-05-11 兴唐通信科技有限公司 Method and system for detecting violation behaviors of staff on duty
CN111062356B (en) * 2019-12-26 2024-03-26 沈阳理工大学 Method for automatically identifying abnormal human body actions from monitoring video
CN112232190B (en) * 2020-10-15 2023-04-18 南京邮电大学 Method for detecting abnormal behaviors of old people facing home scene

Also Published As

Publication number Publication date
CN113688804A (en) 2021-11-23

Similar Documents

Publication Publication Date Title
Zhou et al. Attention-driven loss for anomaly detection in video surveillance
CN110235138B (en) System and method for appearance search
Ma et al. Bayesian loss for crowd count estimation with point supervision
Ruiz et al. Fine-grained head pose estimation without keypoints
Linardos et al. Simple vs complex temporal recurrences for video saliency prediction
Tian et al. Padnet: Pan-density crowd counting
Saypadith et al. Real-time multiple face recognition using deep learning on embedded GPU system
CN112132119B (en) Passenger flow statistical method and device, electronic equipment and storage medium
CN110858394B (en) Image quality evaluation method and device, electronic equipment and computer readable storage medium
CN113688804B (en) Multi-angle video-based action identification method and related equipment
CN110765860A (en) Tumble determination method, tumble determination device, computer apparatus, and storage medium
CN110942006A (en) Motion gesture recognition method, motion gesture recognition apparatus, terminal device, and medium
Fang et al. Deep3DSaliency: Deep stereoscopic video saliency detection model by 3D convolutional networks
Nguyen et al. Deep visual saliency on stereoscopic images
Zhou et al. Recognizing fall actions from videos using reconstruction error of variational autoencoder
CN111259919A (en) Video classification method, device and equipment and storage medium
Takasaki et al. A study of action recognition using pose data toward distributed processing over edge and cloud
Khan et al. Robust head detection in complex videos using two-stage deep convolution framework
CN111444817A (en) Person image identification method and device, electronic equipment and storage medium
Niu et al. Boundary-aware RGBD salient object detection with cross-modal feature sampling
CN105893967B (en) Human behavior classification detection method and system based on time sequence retention space-time characteristics
CN112749605A (en) Identity recognition method, system and equipment
CN115393755A (en) Visual target tracking method, device, equipment and storage medium
Wang et al. Multi-object tracking with adaptive cost matrix
CN112488072A (en) Method, system and equipment for acquiring face sample set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant