CN113743244B

CN113743244B - Video human body accidental action positioning method and device based on counterfactual sample

Info

Publication number: CN113743244B
Application number: CN202110931899.9A
Authority: CN
Inventors: 鲁继文; 徐婧林; 周杰
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2022-10-18
Anticipated expiration: 2041-08-13
Also published as: CN113743244A

Abstract

The application provides a video human body accidental action positioning method based on counterfactual samples, which comprises the following steps: acquiring a human body accidental action video as an original video; designing a causal model, wherein the causal model represents the relationship among input videos, action contents, unexpected actions and model predictions; retrieving an intentional motion part of an original video in a video pool to generate a counterfactual sample; performing feature extraction on an original video and a counterfactual sample to obtain a first space-time feature and a second space-time feature; the first space-time characteristics are input into a standard LSTM network to identify the action intention of each video frame, the first space-time characteristics and the second space-time characteristics are input into a twin LSTM network to learn the cause and effect of the accidental actions of the original video on model prediction, and further the real cause and effect relationship is learned, so that the accurate positioning of the accidental actions of the human body in the original video is realized. According to the method and the device, the accurate positioning of the human body unexpected action in the video is realized by decomposing the action content and the intention clue coupled in the video.

Description

Video human body accidental action positioning method and device based on counterfactual sample

Technical Field

The application relates to the technical field of computer vision, in particular to a video human body accidental action positioning method and device based on counterfactual samples.

Background

When a human being observes a motion, there is a natural tendency to explain the purpose of the motion, i.e., to understand the intent of the motion. An action is intended to encompass both an immediate result of the action, as well as a high-level motivation for causing the action. By itself, humans typically evaluate whether the action results match their previous intentions, a process that is critical to judging action performance, learning empirical knowledge from failures, and updating plans and goals. For others, a human typically judges whether or not the actions of the others are intentional, and infers the intention of the actions. For example, when a human observer performs an action, two pieces of information are typically obtained: how the target performs the action, the intent of the target to perform the action. The human beings know how to finish the action according to experience knowledge, construct the action representation by observing the body language and the behavior mode of the target person, attribute the psychological state and the behavior state of the target person and understand what the target person is doing or wants to do. An action is considered intentional when the result of the action is achieved in the intended way, and unexpected otherwise. Psychological studies have shown that humans are more sensitive to discrepancies between the actual motion outcome and the intent-directed motion outcome. The identification of the human beings on the unexpected actions not only has the inherent advantages, but also has very important survival significance. The ability to recognize other people's planning and behavior targets is especially needed by video motion analysis and understanding technology.

At present, the existing calculation model realizes the analysis and understanding of human body actions from the aspects of visual motion, behavior mode and the like by performing representation learning on human body action tracks, speeds and overall modes in videos. However, these research efforts have primarily stayed on visual recognition of the appearance of motion and have not understood the underlying intent behind motion.

Disclosure of Invention

The present application is directed to solving, at least in part, one of the technical problems in the related art.

Therefore, a first objective of the present application is to provide a video human body unexpected motion positioning method based on counterfactual samples, which solves the technical problems that the existing method mainly studies visual identification of motion appearance, cannot understand potential intention behind motion, and even cannot identify and position human body unexpected motion in a video.

The second purpose of the present application is to provide a video human body unexpected motion positioning device based on counterfactual samples.

A third object of the present application is to propose a non-transitory computer-readable storage medium.

In order to achieve the above object, a first embodiment of the present application provides a method for locating an unexpected motion of a human body based on a counterfactual sample, including: acquiring a human body accidental action video as an original video; designing a causal model, wherein the causal model represents the relationship among an input video, action content, unexpected actions and model prediction, the action content is the action content in an original video, the unexpected actions are the unexpected actions contained in the original video, and the model prediction is the action intention prediction of the input video; generating a counterfactual sample by retrieving an intentional action portion of an original video in a video pool; performing feature extraction on an original video and a counterfactual sample to obtain a first spatiotemporal feature and a second spatiotemporal feature, wherein the first spatiotemporal feature is the spatiotemporal feature of each video frame in the original video, and the second spatiotemporal feature is the spatiotemporal feature of each video frame in the counterfactual sample; the method comprises the steps of inputting a first time-space characteristic into a standard LSTM network to identify action intention of each video frame in an original video, inputting the first time-space characteristic and a second time-space characteristic into a twin LSTM network, learning a causal effect of an unexpected action of the original video on model prediction by calculating a participant effect, further learning a real causal relationship of the unexpected action of the original video on the model prediction, and accordingly achieving accurate positioning of the unexpected action of a human body in the original video, wherein the standard LSTM network and the twin LSTM network are multi-network and share weight values.

Optionally, in an embodiment of the present application, the causal model represents a relationship between input video, motion content, unexpected motion, model prediction, as:

U←X→C

U→Y←C

U←C→Y

wherein X is input video, C is action content, U is unexpected action, Y represents model prediction, U ← X → C represents a segment of input video simultaneously containing action content C and intention clue U, U → Y ← C represents U and C both having influence on final action intentional prediction Y, wherein U is real causal relationship, C is false association caused by training bias, and U ← C → Y represents action content C having influence on unexpected action U and action intentional prediction Y.

Optionally, in an embodiment of the present application, the constructing of the counterfactual sample includes the following steps:

constructing an offline video pool to represent the knowledge that the action is successfully executed, wherein the offline video pool is obtained by searching according to keywords of action content in the original video;

and (3) retrieving a proper video from the video pool by using an unsupervised retrieval method to construct a counterfactual sample, wherein the unsupervised retrieval method specifically comprises the following steps of filtering the retrieved video if the distance between the original video and the retrieved video in the video pool is greater than a preset threshold, and otherwise, storing the video:

wherein the equation represents the process of retrieving, in the video pool V, a counterfactual video V having the same intentional action C = C given a piece of original video consisting of an intentional action C = C and a subsequent unintentional action U = μ;

and dividing the retrieved counterfactual video into the retrieved same intentional action and subsequent intentional development, and generating a final counterfactual sample by splicing the same intentional action and the subsequent intentional development.

Optionally, in an embodiment of the present application, the action intention of each video frame in the original video is identified by inputting a hidden layer feature sequence into a three-way classifier to classify each sequence element as intention, transition and accident.

Optionally, in one embodiment of the application, the calculation of the participant effect is performed by subtracting the counterfactual prediction from the original prediction, wherein the original prediction is expressed as:

Y _U＝μ ＝P(Y|X(C,U))

wherein X (C, U) represents that the input video X contains action content C and intention clue U, Y represents model prediction, Y represents _U＝μ Representing the original prediction, U = μ represents the observed evidence of the true occurrence of an unexpected action;

the counterfactual prediction is expressed as:

wherein, U = mu _c Representing counterfactual samples, Y representing model predictions,

representing counterfactual prediction, X representing input video, C representing action content;

the participant effect is expressed as:

where E denotes an expectation value, U denotes an intention cue, U = μ denotes an observation evidence of a true occurrence of an unexpected motion, and Y _U＝μ Which represents the original prediction, is shown,

representing counterfactual predictions.

In order to achieve the above object, a second aspect of the present application provides a video human body unexpected motion positioning apparatus based on counterfactual samples, including: an acquisition module, a design module, a generation module, an encoder network, a standard LSTM network, a twin LSTM network, wherein,

the acquisition module is used for acquiring a human body unexpected motion video as an original video;

the system comprises a design module and a model prediction module, wherein the design module is used for designing a causal model, the causal model is a relation among an input video, action content, an unexpected action and a model prediction, the action content is action content in an original video, the unexpected action is an unexpected action contained in the original video, and the model prediction is action intention prediction of the input video;

a generation module to generate a counterfactual sample by retrieving an intentional action portion of an original video in a video pool;

the encoder network is used for extracting the characteristics of the original video and the counterfactual samples to obtain a first space-time characteristic and a second space-time characteristic, wherein the first space-time characteristic is the space-time characteristic of each video frame in the original video, and the second space-time characteristic is the space-time characteristic of each video frame in the counterfactual samples;

the standard LSTM network is used for identifying action intention of each video frame in the original video according to the first time-space characteristics;

the twin LSTM network is used for learning the causal effect of the unexpected motion of the original video on model prediction by calculating the participant effect according to the first space-time feature and the second space-time feature, and further learning the real causal relationship of the unexpected motion of the original video on the model prediction, so that the accurate positioning of the human body unexpected motion in the original video is realized.

Optionally, in one embodiment of the present application, the standard LSTM network and the twin LSTM network are jointly optimized by a loss function, which is expressed as:

L＝ETT+λL _CE

＝ETT+λE _X log P(Y|X)

wherein, cross entropy loss function L _CE Optimizing a standard LSTM network, ETT loss function optimizing a twin LSTM network, parameter lambda balancing two loss functions, E _X Denotes counting all input video X, and log P (Y | X) denotes taking natural logarithm of prediction probability.

In order to achieve the above object, a non-transitory computer readable storage medium is provided in a third aspect of the present application, and when executed by a processor, the instructions in the storage medium can perform a method for locating a human body unexpected action based on a counterfactual sample.

According to the video human body accidental action positioning method based on the counterfactual sample, the video human body accidental action positioning device based on the counterfactual sample and the non-transitory computer readable storage medium, the technical problems that visual identification of action appearance is mainly researched, potential intentions behind actions cannot be understood, and human body accidental actions in a video cannot be identified and positioned in the existing method are solved, coupled action contents and intention clues in the video are decomposed by the method for positioning the accidental actions based on the counterfactual sample, negative effects brought by deviated action contents are relieved, and the causal effect of the intention clues on model prediction is calculated, so that the human body accidental actions in the video are accurately positioned.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of a method for locating human body unexpected actions based on counterfactual samples according to an embodiment of the present application;

FIG. 2 is another flowchart of a method for locating human body unexpected actions based on counterfactual samples according to an embodiment of the present application;

FIG. 3 is a diagram illustrating a causal model of a method for locating human body unexpected actions based on counterfactual samples according to an embodiment of the present application;

FIG. 4 is a statistical data diagram of a video accidental motion data set of a video human accidental motion positioning method based on a counterfactual sample according to an embodiment of the present application;

FIG. 5 is an overall framework diagram of a method for locating human body unexpected actions based on counterfactual samples according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a video human body accidental movement positioning device based on a counterfactual sample according to a second embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The following describes a method and an apparatus for locating human body unexpected actions based on counterfactual samples according to embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a flowchart of a method for locating human body unexpected actions based on counterfactual samples according to an embodiment of the present application.

As shown in fig. 1, the method for locating human body unexpected actions based on counterfactual samples comprises the following steps:

step 101, acquiring a human body accidental action video as an original video;

102, designing a causal model, wherein the causal model represents the relationship among an input video, action content, unexpected actions and model prediction, the action content is the action content in an original video, the unexpected actions are the unexpected actions contained in the original video, and the model prediction is the action intention prediction of the input video;

step 103, generating a counterfactual sample by retrieving the intentional action part of the original video in the video pool;

104, extracting the characteristics of the original video and the counterfactual sample to obtain a first space-time characteristic and a second space-time characteristic, wherein the first space-time characteristic is the space-time characteristic of each video frame in the original video, and the second space-time characteristic is the space-time characteristic of each video frame in the counterfactual sample;

and 105, inputting the first time-space characteristics into a standard LSTM network to identify action intention of each video frame in the original video, inputting the first time-space characteristics and the second time-space characteristics into a twin LSTM network, learning a causal effect of an unexpected action of the original video on model prediction by calculating a participant effect, and further learning a real causal relationship of the unexpected action of the original video on the model prediction, so that accurate positioning of the unexpected action of a human body in the original video is realized, wherein the standard LSTM network and the twin LSTM network are multi-network and share weights.

According to the video human body unexpected motion positioning method based on the counterfactual sample, a human body unexpected motion video is obtained and used as an original video; designing a causal model, wherein the causal model represents the relationship among an input video, action content, unexpected actions and model prediction, the action content is the action content in an original video, the unexpected actions are the unexpected actions contained in the original video, and the model prediction is the action intention prediction of the input video; generating a counterfactual sample by retrieving an intentional action portion of an original video in a video pool; performing feature extraction on an original video and a counterfactual sample to obtain a first spatiotemporal feature and a second spatiotemporal feature, wherein the first spatiotemporal feature is the spatiotemporal feature of each video frame in the original video, and the second spatiotemporal feature is the spatiotemporal feature of each video frame in the counterfactual sample; the method comprises the steps of inputting a first time-space characteristic into a standard LSTM network to identify action intention of each video frame in an original video, inputting the first time-space characteristic and a second time-space characteristic into a twin LSTM network, learning a causal effect of an unexpected action of the original video on model prediction by calculating a participant effect, further learning a real causal relationship of the unexpected action of the original video on the model prediction, and accordingly achieving accurate positioning of the unexpected action of a human body in the original video, wherein the standard LSTM network and the twin LSTM network are multi-network and share weight values. Therefore, the technical problems that the visual identification of the motion appearance is mainly researched, the potential intention behind the motion cannot be understood, and the human body unexpected motion in the video cannot be identified and positioned in the existing method can be solved, the coupled motion content and intention clues in the video are decomposed by the unexpected motion positioning method based on the counterfactual sample, the negative influence caused by the deviated motion content is relieved, the causal effect of the intention clues on model prediction is calculated, and the accurate positioning of the human body unexpected motion in the video is realized.

The human body accidental action positioning method based on the counterfactual sample is inspired by the understanding of the action intention of the infant, and mainly comprises 2 key points: the key point 1, carrying out counterfactual intervention, and constructing knowledge for understanding the action intention for the machine, so that the machine has the capability of imagining and observing the intentional development of the original accidental action; and 2, analyzing the action intention by comparing the original unexpected action with the counter-fact intentional action, calculating the causal effect of the action intention on model prediction, and obtaining an accurate positioning result which is not negatively influenced by training deviation.

Further, in the embodiment of the present application, the causal model represents the relationship between the input video, the motion content, the unexpected motion, and the model prediction, and is represented as:

U←X→C

U→Y←C

U←C→Y

wherein, X is input video, C is action content, U is unexpected action, Y represents model prediction, U ← X → C represents a segment of input video simultaneously containing action content C and intention clue U, U → Y ← C represents that both U and C have influence on final action intention prediction Y, wherein U is real cause and effect relationship, C is false association caused by training deviation, and U ← C → Y represents that action content C has influence on unexpected action U and action intention prediction Y.

Further, in the embodiment of the present application, the construction of the counterfactual sample includes the following steps:

wherein the equation represents the process of retrieving a counterfactual video V in the video pool V having the same intended action C = C given a piece of original video consisting of the intended action C = C and a subsequent unintentional action U = μ;

By analyzing the causal model, the method proposes counterfactual intervention, and decomposes the influence of action content and intention clues on model prediction to reduce false correlation.

Calculating the distance between one video V in the video pool V and the intentional action c in the original video, V _c Representing the retrieved counterfactual video. Only videos with counterfactual samples participate in the training process that involves causal inference. The retrieval process is unsupervised and requires no additional annotation information. Videos similar to c content are selected as counterfactual videos and it is assumed that most of the retrieved videos contain intentional actions.

Further, in the embodiment of the application, the action intention of each video frame in the original video is identified by inputting a hidden layer feature sequence into a three-way classifier so as to divide each sequence element into intention, transition and accident.

Further, in the embodiments of the present application, the calculation of the participant effect is performed by subtracting the counterfactual prediction from the original prediction, wherein the original prediction is expressed as:

Y _U＝μ ＝P(Y|X(C,U))

wherein X (C, U) represents that the input video X contains action content C and intention clue U, Y represents model prediction, and Y represents _U＝μ Representing the original prediction, U = μ represents the observed evidence of the true occurrence of the unexpected action;

the counterfactual prediction is expressed as:

the participant effect is expressed as:

where E denotes an expectation value, U denotes an intention cue, U = μ denotes an observation evidence of a true occurrence of an unexpected action, and Y denotes _U＝μ Which represents the original prediction, is shown,

representing counterfactual predictions.

In the case of similar video motion content, the primary difference between the original video and the counterfactual samples is the intent, which mitigates the negative effects of training bias and emphasizes the causal effects of intent on model prediction by maximizing the ETT of intent.

Fig. 2 is another flowchart of a method for locating human body unexpected actions based on counterfactual samples according to an embodiment of the present application.

As shown in fig. 2, the method for locating human body unexpected actions based on counterfactual samples includes: for any given section of human body accidental action video, generating a counterfactual sample by retrieving an intentional action part of an original video in a video pool, and simultaneously inputting the original video and the counterfactual sample into a 3D-ResNet network to extract video spatiotemporal features; then, a standard LSTM is applied to identify action intention of each video frame, and a twin LSTM is used to decompose the video action content and the causal effect of the intention clue on model prediction by calculating the participant Effect (ETT), so as to learn the real causal relationship of the intention clue on the model prediction; wherein the standard LSTM and the twin LSTM are weight-shared.

Fig. 3 is a diagram illustrating a causal model of a video human body unexpected motion positioning method based on a counterfactual sample according to an embodiment of the present application.

As shown in fig. 3, the nodes in the figure represent the input video X, the motion content C, the unexpected motion U, and the intentional prediction Y, respectively. Taking the video of "boys fell while playing skateboards" as an example, C indicates skateboards, U indicates falls, and Y indicates model prediction (prediction of action intention). The arrows in the figure represent the dependency between two variables: u ← X → C denotes a piece of input video containing both action content C and intention cue U; u → Y ← C denotes that both U and C have an effect on the final motion conscious prediction Y, where U is a true causal relationship and C is a false association caused by a training bias; u ← C → Y denotes that the action content C has an influence on the accidental action U and the action intention prediction Y.

Fig. 4 is a statistical data diagram of a video accidental motion data set of a video human accidental motion positioning method based on a counterfactual sample according to an embodiment of the present application.

As shown in FIG. 4, the unexpected motion in training set always occurs in [1.6,4.2] seconds for "slide-in" video, and the "handstand" video occurs mainly in [2.1,6.3] seconds. Thus, the true cause of the model prediction Y is likely to be erroneously attributed to the action content C, rather than the occurrence of the unexpected action U.

Fig. 5 is an overall framework diagram of a video human body unexpected motion positioning method based on a counterfactual sample according to an embodiment of the present application.

As shown in fig. 5, the overall framework of the method for locating human body unexpected actions based on counterfactual samples includes three modules: the method comprises the following steps that an encoder network extracts visual features, a standard LSTM network locates unexpected actions, and a twin LSTM network learns causal relationships by comparing fact samples with counter fact samples, wherein the encoder network inputs videos and outputs space-time features of each video frame; inputting an original video into a more-than-one standard LSTM network to identify whether unexpected actions occur in each video frame, namely inputting a hidden layer feature sequence into a three-way classifier to divide each sequence element into intention, transition and accident; the original video and counterfactual samples are simultaneously input into a twin LSTM network, which consists of two many-to-many standard LSTM networks whose weights are shared by highlighting the true causal effects of the intent by comparing the factual unexpected actions with the counterfactual intentional actions.

As shown in fig. 6, the video human accidental action positioning device based on counterfactual samples comprises: an acquisition module, a design module, a generation module, an encoder network, a standard LSTM network, a twin LSTM network, wherein,

the acquisition module 10 is used for acquiring a human body accidental action video as an original video;

the design module 20 is configured to design a causal model, where the causal model is a relationship between an input video, motion content, an unexpected motion, and model prediction, where the motion content is motion content in an original video, the unexpected motion is an unexpected motion included in the original video, and the model prediction is motion intention prediction of the input video;

a generation module 30 for generating a counterfactual sample by retrieving an intentional action part of an original video in a video pool;

the encoder network 40 is used for extracting the characteristics of the original video and the counterfactual samples to obtain a first space-time characteristic and a second space-time characteristic, wherein the first space-time characteristic is the space-time characteristic of each video frame in the original video, and the second space-time characteristic is the space-time characteristic of each video frame in the counterfactual samples;

a standard LSTM network 50 for identifying the activity significance of each video frame in the original video based on the first spatiotemporal features;

the twin LSTM network 60 is used for learning the causal effect of the unexpected motion of the original video on the model prediction by calculating the participant effect according to the first space-time feature and the second space-time feature, and further learning the real causal relationship of the unexpected motion of the original video on the model prediction, so that the accurate positioning of the human body unexpected motion in the original video is realized.

Further, in the embodiment of the present application, the standard LSTM network and the twin LSTM network are jointly optimized by a loss function, which is expressed as:

L＝ETT+λL _CE

＝ETT+λE _X log P(Y|X)

wherein, cross entropy loss function L _CE Optimizing standard LSTM network, ETT loss function optimizing twin LSTM network, parameter lambda flatTwo loss functions are balanced, E _X Denotes counting all input video X, and log P (Y | X) denotes taking natural logarithm of prediction probability.

The video human body accidental action positioning device based on the counterfactual sample comprises an acquisition module, a design module, a generation module, an encoder network, a standard LSTM network and a twin LSTM network, wherein the acquisition module is used for acquiring a human body accidental action video as an original video; the system comprises a design module, a model prediction module and a display module, wherein the design module is used for designing a causal model, the causal model is a relation among an input video, action content, unexpected actions and model prediction, the action content is action content in an original video, the unexpected actions are unexpected actions contained in the original video, and the model prediction is action intention prediction of the input video; a generation module for generating a counterfactual sample by retrieving an intentional action portion of an original video in a video pool; the encoder network is used for extracting the characteristics of the original video and the counterfactual samples to obtain a first space-time characteristic and a second space-time characteristic, wherein the first space-time characteristic is the space-time characteristic of each video frame in the original video, and the second space-time characteristic is the space-time characteristic of each video frame in the counterfactual samples; the standard LSTM network module is used for identifying the action intention of each video frame in the original video according to the first time-space characteristics; the twin LSTM network is used for learning the causal effect of the unexpected motion of the original video on model prediction by calculating the participant effect according to the first space-time characteristic and the second space-time characteristic, and further learning the real causal relationship of the unexpected motion of the original video on the model prediction, so that the accurate positioning of the unexpected motion of the human body in the original video is realized. Therefore, the technical problems that the visual identification of the motion appearance is mainly researched, the potential intention behind the motion cannot be understood, and the human body unexpected motion in the video cannot be identified and positioned in the existing method can be solved, the coupled motion content and intention clues in the video are decomposed by the unexpected motion positioning method based on the counterfactual sample, the negative influence caused by the deviated motion content is relieved, the causal effect of the intention clues on model prediction is calculated, and the accurate positioning of the human body unexpected motion in the video is realized.

In order to implement the foregoing embodiments, the present application further proposes a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for locating a video human body unexpected motion based on counterfactual samples of the foregoing embodiments is implemented.

In the description of the present specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A video human body unexpected motion positioning method based on counterfactual samples is characterized by comprising the following steps:

acquiring a human body accidental action video as an original video;

designing a causal model, wherein the causal model represents a relation among an input video, action content, an unexpected action and model prediction, the action content is the action content in the original video, the unexpected action is the unexpected action contained in the original video, and the model prediction is the action intention prediction of the input video;

generating a counterfactual sample by retrieving an intentional action portion of the original video in a video pool;

performing feature extraction on the original video and the counterfactual sample to obtain a first spatiotemporal feature and a second spatiotemporal feature, wherein the first spatiotemporal feature is the spatiotemporal feature of each video frame in the original video, and the second spatiotemporal feature is the spatiotemporal feature of each video frame in the counterfactual sample;

inputting the first spatiotemporal feature into a standard LSTM network to identify action intention of each video frame in the original video, inputting the first spatiotemporal feature and the second spatiotemporal feature into a twin LSTM network, learning causal effect of unexpected action of the original video on model prediction by calculating participant effect, further learning true causal relationship of the unexpected action of the original video on the model prediction, and thus realizing accurate positioning of human body unexpected action in the original video, wherein the standard LSTM network and the twin LSTM network are multi-network and share weight.

2. The method of claim 1, wherein the causal model represents a relationship between input video, motion content, unexpected motion, model predictions, as:

U←X→C

U→Y←C

U←C→Y

wherein, X is the input video, C is the action content, U is the unexpected action, Y represents model prediction, U ← X → C represents a segment of input video simultaneously containing action content C and intention cue U, U → Y ← C represents U and C both having an effect on final action intention prediction Y, where U is a true causal relationship, C is a false association caused by a training deviation, and U ← C → Y represents action content C having an effect on unexpected action U and action intention prediction Y.

3. The method of claim 1, wherein the construction of the counterfactual sample comprises the steps of:

constructing an offline video pool to represent knowledge that an action is successfully executed, wherein the offline video pool is obtained by searching according to keywords of action contents in the original video;

retrieving a suitable video from the video pool by using an unsupervised retrieval method to construct a counterfactual sample, wherein the unsupervised retrieval method specifically includes that if the distance between the original video and the retrieved video in the video pool is greater than a preset threshold, the retrieved video is filtered, otherwise, the retrieved video is stored, and the counterfactual sample is represented as:

4. The method according to claim 1, wherein the act of identifying the meaningfulness of each video frame in the original video is by inputting a sequence of hidden layer features into a three-way classifier to classify each sequence element as meaningfulness, transition, and surprise.

5. The method of claim 1, wherein the calculation of the participant effect is performed by subtracting a counterfactual prediction from an original prediction, wherein the original prediction is represented as:

Y _U＝μ ＝P(Y|X(C,U))

the counterfactual prediction is expressed as:

wherein U = μ _c Representing counterfactual samples, Y representing model predictions,

representing counterfactual prediction, X representing input video, and C representing action content;

the participant effect is represented as:

representing counterfactual predictions.

6. A video human body accidental action positioning device based on counterfactual samples is characterized by comprising an acquisition module, a design module, a generation module, an encoder network, a standard LSTM network and a twin LSTM network, wherein,

the acquisition module is used for acquiring a human body accidental action video as an original video;

the design module is used for designing a causal model, wherein the causal model is a relation among an input video, action content, an unexpected action and model prediction, the action content is the action content in the original video, the unexpected action is the unexpected action contained in the original video, and the model prediction is the action intention prediction of the input video;

the generation module is used for generating a counterfactual sample by retrieving the intentional action part of the original video in a video pool;

the encoder network performs feature extraction on the original video and the counterfactual sample to obtain a first spatiotemporal feature and a second spatiotemporal feature, wherein the first spatiotemporal feature is a spatiotemporal feature of each video frame in the original video, and the second spatiotemporal feature is a spatiotemporal feature of each video frame in the counterfactual sample;

the standard LSTM network is used for identifying action significance of each video frame in the original video according to the first time-space characteristics;

the twin LSTM network is used for learning the causal effect of the unexpected motion of the original video on model prediction by calculating a participant effect according to the first space-time feature and the second space-time feature, and further learning the real causal relationship of the unexpected motion of the original video on the model prediction, so that the accurate positioning of the human body unexpected motion in the original video is realized.

7. The apparatus of claim 6, wherein the standard LSTM network and the twin LSTM network are jointly optimized by a loss function represented as:

L＝ETT+λL _CE

＝ETT+λE _X log P(Y|X)

wherein, cross entropy loss function L _CE Optimizing a standard LSTM network, ETT loss function optimizing a twin LSTM network, parameter λ balancing two loss functions, E _X Denotes counting all input video X, and log P (Y | X) denotes taking natural logarithm of prediction probability.

8. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method of any one of claims 1-5.