CN108234821B

CN108234821B - Method, device and system for detecting motion in video

Info

Publication number: CN108234821B
Application number: CN201710131577.XA
Authority: CN
Inventors: 熊元骏; 赵岳; 王利民; 林达华; 汤晓鸥
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2017-03-07
Filing date: 2017-03-07
Publication date: 2020-11-06
Anticipated expiration: 2037-03-07
Also published as: CN108234821A

Abstract

The application discloses a method, a device and a system for detecting actions in videos. The method for detecting the motion in the video comprises the following steps: obtaining action degree estimated values respectively corresponding to a plurality of segments in a video; generating a time sequence action degree sequence of the video according to the action degree estimated values respectively corresponding to the plurality of segments; and aggregating to obtain the motion prediction time domain interval in the video based on the time sequence motion degree sequence.

Description

Method, device and system for detecting motion in video

Technical Field

The present application relates to the field of Computer Vision (CV), and in particular, to a method, apparatus and system for detecting motion in video.

Background

Understanding human behavior and actions is an important task of computer vision systems. Currently, the motion recognition technology of the computer vision system uses video as input data, but only video segments which are clipped and only contain motion-related content can be processed. In the process of motion recognition, the motion recognition system mainly uses a deep learning method to analyze dynamically changing video content and overcomes the influence caused by factors such as distance, view angle change, camera movement and scene change.

The motion recognition technology can recognize motion types in clipped video by fusing video contents such as shape information, motion characteristics, and long-range time series relationships by a deep learning method. However, when the motion recognition technology is applied to practice, the video which needs to be processed by the system is often the original video which is not clipped. This requires that the motion recognition system is no longer limited to being able to recognize only the category of motion, but also to be able to detect the start and end times of each instance of motion in the raw video that is not clipped. This task may be referred to as temporal motion detection, i.e. the start and end times of all motion instances and the motion class to which they belong are detected in the video. The time domain motion detection technology has great application value in the fields of safety monitoring, network video analysis, video live broadcast analysis, unmanned driving and the like.

Currently, the research on time-domain motion detection is not sufficient, and especially, an ideal method for accurately detecting the start time and the end time of all motion instances is still lacking.

Disclosure of Invention

The application provides a method, a device and a system for detecting actions in videos.

According to one aspect of the application, a method of detecting motion in a video includes: obtaining action degree estimated values respectively corresponding to a plurality of segments in a video; generating a time sequence action degree sequence of the video according to the action degree estimated values respectively corresponding to the plurality of segments; and aggregating to obtain the motion prediction time domain interval in the video based on the time sequence motion degree sequence.

According to one embodiment, the obtaining of the motion estimation values corresponding to the plurality of segments in the video comprises: calculating a first action degree estimation value corresponding to each of a plurality of segments in the video by using a first convolutional neural network, wherein the first action degree estimation value is obtained from any one frame of image in the segments; calculating a second action degree estimation value corresponding to each segment by using a second convolutional neural network, wherein the second action degree estimation value is obtained from optical flow field images of the segments, and the optical flow field images are formed by combining optical flow fields extracted from any multi-frame images in the segments; and obtaining the action degree estimated value corresponding to each segment according to the first action degree estimated value and the second action degree estimated value corresponding to each segment.

According to an embodiment, the obtaining the motion estimation value corresponding to each segment according to the first motion estimation value and the second motion estimation value corresponding to each segment includes: and adding the first action degree estimation value and the second action degree estimation value corresponding to each segment to obtain the action degree estimation value corresponding to each segment.

According to an embodiment, the generating the time-series motion estimation sequence of the video according to the motion estimation values respectively corresponding to the plurality of segments includes: and normalizing the action degree estimated values respectively corresponding to the plurality of segments, and generating a time sequence action degree sequence of the video according to each action degree estimated value after the normalization processing.

According to an embodiment, the aggregating motion prediction temporal intervals in the video based on the time-series motion degree sequence includes: obtaining each foreground segment in the time sequence action degree sequence, wherein the action degree estimated value corresponding to the foreground segment is greater than or equal to an action degree threshold value; aggregating a plurality of the foreground segments that are adjacent to each other into the motion prediction temporal interval based on a tolerance threshold.

According to one embodiment, the tolerance threshold is used to characterize the proportion of the foreground segment in the motion prediction temporal interval.

According to one embodiment, the method further comprises: and determining the action type corresponding to the action prediction time domain interval according to the segments contained in the action prediction time domain interval.

According to an embodiment, the determining, according to the segments included in the motion prediction time domain interval, the motion category corresponding to the motion prediction time domain interval includes: respectively obtaining the scores of all the segments in the action prediction time domain interval based on each action category, and combining the scores of all the segments to obtain the total score of the current action category; and determining the target action category of the action prediction time domain interval according to the total score of each action category.

According to one embodiment, the determining the target action category of the action prediction time domain interval according to the total score of each action category includes: and determining the action category with the highest total score as a target action category of the action prediction time domain interval.

According to one embodiment, the method further comprises: and determining the action integrity of the action prediction time domain interval according to each segment contained in the action prediction time domain interval and the adjacent segment of the action prediction time domain interval.

According to an embodiment, the determining the motion integrity of the motion prediction time domain interval according to each segment included in the motion prediction time domain interval and the segment adjacent to the motion prediction time domain interval includes: obtaining scores of adjacent segments of the action prediction time domain interval based on the target action category; and determining the action integrity of the action prediction time domain interval according to each segment in the action prediction time domain interval and the score of each adjacent segment.

According to one embodiment, the motion prediction temporal interval includes a motion prediction start time and a motion prediction end time.

According to another aspect of the present application, an apparatus for detecting motion in a video includes: the acquisition module is used for acquiring action degree estimated values respectively corresponding to a plurality of segments in the video; the generating module is used for generating a time sequence action degree sequence of the video according to the action degree estimated values respectively corresponding to the plurality of segments; and the aggregation module is used for aggregating to obtain the action prediction time domain interval in the video based on the time sequence action degree sequence.

According to one embodiment, the acquisition module comprises: the first convolutional neural network is used for calculating a first action degree estimation value corresponding to each of a plurality of segments in the video, and the first action degree estimation value is obtained from any one frame of image in the segments; the second convolutional neural network is used for calculating a second action degree estimated value corresponding to each segment, the second action degree estimated value is obtained from the optical flow field images of the segments, and the optical flow field images are formed by combining optical flow fields extracted from any multi-frame images in the segments; and the first obtaining submodule is used for obtaining the action degree estimated value corresponding to each segment according to the first action degree estimated value and the second action degree estimated value corresponding to each segment.

According to an embodiment, the first obtaining sub-module adds the first motion estimation value and the second motion estimation value corresponding to each segment to obtain the motion estimation value corresponding to each segment.

According to one embodiment, the generating module normalizes the motion estimation values corresponding to the plurality of segments, and generates the time sequence motion sequence of the video according to each motion estimation value after the normalization.

According to one embodiment, the aggregation module comprises: the second obtaining submodule is used for obtaining each foreground segment in the time sequence action degree sequence, and the action degree estimated value corresponding to the foreground segment is larger than or equal to an action degree threshold value; and the aggregation sub-module is used for aggregating a plurality of adjacent foreground segments into the action prediction time domain interval based on a tolerance threshold value.

According to one embodiment, the apparatus further comprises: and the action type determining module is used for determining the action type corresponding to the action prediction time domain interval according to the fragments contained in the action prediction time domain interval.

According to one embodiment, the action category determination module comprises: the score calculation submodule is used for acquiring the scores of all the segments in the action prediction time domain interval based on each action category and combining the scores of all the segments to obtain the total score of the current action category; and the first determining submodule is used for determining the target action category of the action prediction time domain interval according to the total score of each action category.

According to one embodiment, the first determination submodule determines the action category with the highest total score as the target action category of the action prediction time domain interval.

According to one embodiment the apparatus further comprises: and the action integrity determining module is used for determining the action integrity of the action prediction time domain interval according to each segment contained in the action prediction time domain interval and the adjacent segment of the action prediction time domain interval.

According to one embodiment, the action integrity determination module comprises: a third obtaining sub-module, configured to obtain, based on the target action category, a score of each adjacent segment of the action prediction time domain interval; and a second determining submodule, configured to determine an action integrity of the action prediction time domain interval according to each segment in the action prediction time domain interval and the score of each adjacent segment.

According to yet another aspect of the present application, a system for detecting motion in a video includes: a memory for storing executable instructions; and a processor in communication with the memory to execute the executable instructions to: obtaining action degree estimated values respectively corresponding to a plurality of segments in a video; generating a time sequence action degree sequence of the video according to the action degree estimated values respectively corresponding to the plurality of segments; and aggregating to obtain the motion prediction time domain interval in the video based on the time sequence motion degree sequence.

Therefore, a time sequence action degree sequence of the video is generated according to the action degree estimation values respectively corresponding to the segments in the video, and then the time sequence action degree sequence is aggregated, so that the occurrence time (such as the starting time and the ending time) of one or more action instances in the video can be obtained. Thus, the occurrence time of each action instance can be detected in the un-clipped original video, enabling temporal action detection.

Drawings

FIG. 1 shows a flow diagram of a method of detecting motion in a video according to one embodiment of the present application.

Fig. 2 is a flowchart illustrating obtaining motion estimation values corresponding to a plurality of segments in a video according to an embodiment of the present application.

Fig. 3 shows a flowchart for obtaining motion prediction temporal intervals in a video based on time-series motion degree sequence aggregation according to an embodiment of the present application.

Fig. 4A and 4B respectively show the aggregation results of the same time sequence activity degree sequence under different tolerance thresholds.

Fig. 5 shows a flow diagram of a method of detecting motion in a video according to another embodiment.

Fig. 6 shows a flowchart for determining an action category corresponding to an action prediction time domain interval according to a segment contained in the action prediction time domain interval according to an embodiment of the present application.

Fig. 7 shows a flow diagram of a method of detecting motion in a video according to another embodiment.

Fig. 8 shows a flowchart for determining the motion integrity of the motion prediction time domain interval according to each segment contained in the motion prediction time domain interval and the adjacent segments of the motion prediction time domain interval according to an embodiment of the present application.

FIG. 9 shows a block diagram of an apparatus for detecting motion in a video according to an embodiment of the present application.

FIG. 10 shows a block diagram of an acquisition module according to one embodiment of the present application.

FIG. 11 shows a block diagram of an aggregation module, according to an embodiment of the present application.

Fig. 12 shows a block diagram of an apparatus for detecting motion in a video according to another embodiment of the present application.

FIG. 13 illustrates a block diagram of an action category determination module according to one embodiment of the present application.

Fig. 14 shows a block diagram of an apparatus for detecting motion in a video according to another embodiment of the present application.

FIG. 15 shows a block diagram of an action integrity determination module according to one embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below with reference to the accompanying drawings. It should be noted that the following description is merely exemplary in nature and is not intended to limit the present application. Further, in the following description, the same reference numbers will be used to refer to the same or like parts in different drawings. The different features in the different embodiments described below can be combined with each other to form further embodiments within the scope of the application.

In the description of the present application, a "Convolutional Neural Network (CNN)" may be a Convolutional Neural Network that has been trained (e.g., may be trained using a time-series segmented Network framework), and is capable of classifying an input image or video segment to give a score for the input image or video segment with respect to each preset action category, or to give an estimate of the action degree of the video segment.

In the description of the present application, a "segment" in a video refers to a portion of the video that may include multiple frames of consecutive images in the video.

In the description of the present application, "degree of action" characterizes the extent to which an instance of action may occur for a certain image or video clip.

FIG. 1 shows a flow diagram of a method of detecting motion in a video according to one embodiment of the present application. As shown in fig. 1, the method 1000 includes steps S1100 to S1300.

In step S1100, motion estimation values corresponding to a plurality of segments in a video are obtained.

According to one embodiment, a person skilled in the art may select a convolutional neural network for training according to actual needs, and then use the trained convolutional neural network in step S1100 to obtain an action degree estimate for each video segment.

In step S1200, a time-series motion level sequence of the video is generated based on the motion level estimates corresponding to the plurality of segments.

The motion estimation values of the video segments obtained in step S1100 may form a motion sequence in time sequence. For example, the video comprises N video segments, and the motion degree estimates calculated by the convolutional neural network are respectively a₁～A_NThen the sequence of temporal activity degrees for the video may be represented as a₁，A₂，A₃，……，A_N。

Subsequently, in step S1300, based on the time-series motion degree sequence, motion prediction time domain intervals in the video are aggregated.

Also taking the above video containing N segments as an example, for the time sequence A₁，A₂，A₃，……，A_NThe aggregation algorithm can be performed by a time-series action degree (as will be described in detail below)Description), aggregate to one or more action instances, which may be represented as, for example, a₁A₂、A₅A₆A₇And the like, wherein A₁A₂Representing a first and a second segment in a video as an action instance, A₅A₆A₇The fifth, sixth and seventh segments in the video are shown as an action example. Due to A₁，A₂，A₃，……，A_NIs a time-sequential sequence, and therefore each element in the sequence has a corresponding time coordinate, so that the occurrence time interval of each action instance, i.e. the predicted time domain interval in the time domain, can be found, which represents the position in the time domain of a series of video segments where action instances may exist.

Therefore, a time sequence action degree sequence of the video is generated according to the action degree estimation values respectively corresponding to the segments in the video, and then the time sequence action degree sequence is aggregated, so that the occurrence time (such as the starting time and the ending time) of one or more action instances in the video can be obtained. According to the embodiment of the application, the starting time and the ending time of each action instance can be detected in the original video which is not clipped, and the time domain action detection is realized.

According to one embodiment, the motion prediction time domain interval may include a motion prediction start time and a motion prediction end time.

Fig. 2 is a flowchart illustrating obtaining motion estimation values corresponding to a plurality of segments in a video according to an embodiment of the present application. As shown in fig. 2, the step S1100 includes sub-steps S1110 to S1130.

In sub-step S1110, a first motion estimation value corresponding to each of a plurality of segments in the video is calculated by using a first convolutional neural network.

Wherein the first motion estimation is obtained from any frame of image in the segment. Those skilled in the art can select the first convolutional neural network for training according to actual needs, and then use the trained first convolutional neural network in step S1110 to calculate the motion estimation of one frame of image in each video segment. In practice, one frame of image from the video segment can be arbitrarily selected and input into the first convolutional neural network to calculate the motion estimation. For example, the selected frame of image may be the first frame of image in a video clip.

In sub-step S1120, a second motion degree estimate corresponding to each segment is calculated using a second convolutional neural network.

The second motion estimate is derived from the optical flow field images of the segments, which are merged by the optical flow field extracted from any of the plurality of frames of images in the segments. The "optical flow field image" of a segment described herein refers to an image formed by merging optical flow fields extracted from a plurality of frames of images in the segment. Those skilled in the art can select the second convolutional neural network for training according to actual needs, and then use the trained second convolutional neural network in step S1120 to calculate the motion estimation value of the optical flow field image of each video segment. In practical operation, multiple frames of images (for example, five frames of images after the one frame of image selected in step S1110) may be arbitrarily selected from the video segment and optical flows thereof are extracted to obtain multiple frames of optical flow field images, and the optical flow field images of these frames are merged to obtain the optical flow field image of the video segment.

The optical flow field is a two-dimensional vector field which can be represented in a scaling way as two single-channel images, and each image respectively represents the magnitude of velocity components in the horizontal direction and the vertical direction. For example, the pixel values of the optical flow field of the selected five frames of images are respectively mapped to the interval from 0 to 255 in a linear manner according to the standard with the mean value of 128, and then the optical flow field images of the five frames of images are combined into a ten-channel image as an individual channel to be used as the input of the second convolutional neural network.

In sub-step S1130, the motion degree estimate corresponding to each segment is obtained according to the first motion degree estimate and the second motion degree estimate corresponding to each segment.

For example, the output result of the first convolutional neural network in sub-step S1110 (i.e., the first motion estimation) and the output result of the second convolutional neural network in sub-step S1120 (i.e., the second motion estimation) may be added to obtain a value, which may be used as the motion estimation of the video segment. The "addition" described herein may be a simple superposition of two values, or may be a weighted addition, and those skilled in the art can perform weighted summation calculation according to actual needs.

Therefore, by using different convolution neural networks, the motion degree estimation value of one frame of image in the video clip and the motion degree estimation value of the optical flow field image synthesized by a plurality of frames of images can be obtained, and then the motion degree estimation value of the one frame of image and the motion degree estimation value of the optical flow field image synthesized by a plurality of frames of images are superposed to obtain the motion degree estimation value of the video clip.

According to an embodiment of the present application, the step S1200 may include: and normalizing the action degree estimated values respectively corresponding to the plurality of segments, and generating a time sequence action degree sequence of the video according to each action degree estimated value after the normalization processing. In this case, the motion estimation values of the plurality of segments can be mapped to the range of [0,1] by using a preset threshold function, so as to obtain a normalized time sequence motion sequence. For example, the motion estimation value of each video segment can be mapped into the range of [0,1] by using a Sigmoid function or a Tanh function. Therefore, after each numerical value in the time sequence action degree sequence is normalized, the subsequent operation processing can be facilitated.

Fig. 3 shows a flowchart for obtaining motion prediction temporal intervals in a video based on time-series motion degree sequence aggregation according to an embodiment of the present application. As shown in fig. 3, the above step S1300 includes sub-steps S1310 and S1320.

In sub-step S1310, each foreground segment in the sequence of temporal motion degrees is acquired.

The motion estimation corresponding to the foreground segment may be greater than or equal to the motion threshold. Here, a foreground segment refers to a video segment in which there is a large possibility of an action instance due to a high action degree estimate (greater than or equal to a preset action threshold). The preset action threshold according to one embodiment may be determined according to actual needs and actual conditions. For example, when each motion metric estimate in the time series motion metric sequence has been normalized, the motion threshold may be preset to 0.4, 0.5, or 0.6, and so on.

Then, in sub-step S1320, based on the tolerance threshold, the adjacent foreground segments are aggregated into the motion prediction temporal interval.

According to one embodiment, in a Temporal action aggregation (Temporal action Grouping) algorithm, a tolerance threshold is used to characterize the proportion of foreground segments in an action prediction Temporal interval. For example, fig. 4A and 4B respectively show the aggregation results of the same timing activity sequence under different tolerance thresholds. In the time series of the degree of action shown in fig. 4A and 4B, the number "1" represents a foreground segment and the number "0" represents a non-foreground segment (or may be referred to as a background segment). Therefore, the prediction time domain intervals of a plurality of actions can be obtained by selecting different action thresholds and different tolerance thresholds.

As shown in fig. 4A, when the preset tolerance threshold is small (τ ═ 0.6), the predicted time domain intervals of the two actions can be obtained through aggregation. On the other hand, as shown in fig. 4B, when the preset tolerance threshold is large (τ is 0.8), only one motion prediction time domain interval is obtained by aggregation. Therefore, the detection accuracy of the action in the video can be controlled by controlling the value of the tolerance threshold. The larger the tolerance threshold is set, the more prediction time domain intervals are obtained by aggregation, but the intervals may contain more segments without action instances. The smaller the tolerance threshold is set, the fewer prediction temporal intervals are aggregated, but fewer segments are contained in these intervals without action instances.

Fig. 5 shows a flow diagram of a method of detecting motion in a video according to another embodiment. As shown in fig. 5, the method 1000' includes a step S1400 in addition to the steps S1100 to S1300. For the sake of brevity, only the differences of the embodiment shown in fig. 5 from fig. 1 will be described below, and detailed descriptions of the same parts will be omitted.

In step S1400, an action type corresponding to the action prediction time domain section is determined from the segments included in the action prediction time domain section.

Therefore, not only the occurrence time (i.e. the starting time and the ending time) of each action instance in the video can be obtained, but also the action type of each action instance can be detected and used for carrying out corresponding processing on segments of different action types.

Fig. 6 shows a flowchart for determining an action category corresponding to an action prediction time domain interval according to a segment contained in the action prediction time domain interval according to an embodiment of the present application. As shown in fig. 6, the above step S1400 includes sub-steps S1410 and S1420.

In sub-step S1410, based on each motion category, the score of each segment in the motion prediction time domain interval is obtained, and the scores of the segments are combined to obtain the total score of the current motion category.

According to one embodiment, a preset motion classifier may be utilized to obtain the score of each segment in the motion prediction time domain interval relative to each motion category. According to one embodiment, the motion classifier may be a convolutional neural network. Those skilled in the art can select an appropriate convolutional neural network for training according to actual needs, and then use the trained convolutional neural network in sub-step S1410 as a motion classifier. The action classifier is used for processing the input video segment to output the score of the video segment relative to each preset action category in the action classifier.

Generally, the prediction time domain interval of one action instance contains a plurality of video segments, and the scores of the prediction time domain interval on the action category can be obtained by combining (for example, simply overlapping) the video segments with respect to the score of the same action category.

Subsequently, in sub-step S1420, a target action category of the action prediction time domain interval is determined according to the total score of each action category.

After the total score of the predicted time-domain section for each action category is obtained, the action category with the highest total score may be determined as the target action category of the predicted time-domain section.

Fig. 7 shows a flow diagram of a method of detecting motion in a video according to another embodiment. As shown in fig. 7, the method 1000 ″ includes step S1500 in addition to steps S1100 to S1400. For the sake of brevity, only the differences of the embodiment shown in fig. 7 from fig. 5 will be described below, and detailed descriptions of the same parts will be omitted.

In step S1500, the motion integrity of the motion prediction time domain section is determined based on each segment included in the motion prediction time domain section and the segment adjacent to the motion prediction time domain section.

Thus, not only the occurrence time (i.e., start time and end time) and the action category of each action instance in the video may be obtained, but it may also be determined whether each action instance is complete. Thus, incomplete action instances can be deleted from the resulting results, such that the resulting action instances each have a more accurate edge segment.

Fig. 8 shows a flowchart for determining the motion integrity of the motion prediction time domain interval according to each segment contained in the motion prediction time domain interval and the adjacent segments of the motion prediction time domain interval according to an embodiment of the present application. As shown in fig. 8, the above step S1500 includes sub-steps S1510 and S1520.

In sub-step S1510, based on the target motion category, a score of each neighboring segment of the motion prediction temporal interval is obtained.

The number of segments adjacent to the prediction temporal interval can be selected by those skilled in the art according to actual needs, for example, a plurality of video segments before and/or after the prediction temporal interval (e.g., the total length is 20% of the prediction temporal interval) can be selected.

In sub-step S1520, the motion integrity of the motion prediction time domain section is determined according to the scores of each segment and each adjacent segment in the motion prediction time domain section.

According to one embodiment, the above sub-step S1520 may be accomplished in the following manner.

First, an input feature is generated from scores for each action category of a segment included in a prediction time domain section and a segment adjacent to the prediction time domain section. The input features constructed in the present application will be described in detail below.

Then, the generated input features are input into a completeness classifier corresponding to the action category of the prediction time domain interval to obtain an action completeness score of the prediction time domain interval. According to one embodiment, each action category may correspond to a preset integrity classifier to determine the integrity of the action category, and the integrity classifier may be a Support Vector Machine (SVM). Those skilled in the art can select an appropriate SVM for training according to actual needs (for example, training can be performed by using a difficult sample mining framework), and then use the trained SVM as an integrity classifier for obtaining an action integrity score of the prediction time domain interval. The integrity classifier is operative to process the input features to give corresponding action integrity scores.

Therefore, the video segments in the prediction time domain interval are utilized, and the video segments adjacent to the prediction time domain interval are utilized to judge the integrity of the action example in the interval, so that the edge of the action example can be detected more accurately, and a more accurate result can be obtained.

According to an embodiment of the present application, the constructed input features may include a score of all segments contained in the prediction time domain interval with respect to each action category. Further, the constructed input features may also include a score for a portion of the segments contained in the predicted time-domain interval relative to each action category. Still further, the constructed input features may also include scores for segments adjacent to the prediction temporal interval relative to each action category.

The input features constructed in the present application will be specifically described below by one embodiment. First, assuming that n video segments are included in a video and m motion categories are preset, the score of the jth segment (1 ≦ j ≦ n) in the video relative to the ith motion category (1 ≦ i ≦ m) can be recorded as S_ijThe scores for all n segments in the video relative to all m motion categories may then be represented by the following matrix.

In the matrix representation, each row element represents the score of all the segments in the video relative to one action category, and each column element represents the score of one segment in the video relative to all the action categories.

After the time domain interval of the motion is obtained through the sub-step S1230, the interval corresponds to a plurality of columns in the matrix. According to this embodiment, the constructed input features may include elements of all columns corresponding to the prediction time domain interval. Further, the constructed input features may further include elements of a part of all columns corresponding to the prediction time domain interval. Still further, the constructed input features may further include elements of columns adjacent to the prediction time domain interval.

Thus, the scores of the segments in or adjacent to the prediction temporal interval with respect to different action categories may be utilized to construct the input features, thereby obtaining more accurate results.

FIG. 9 shows a block diagram of an apparatus for detecting motion in a video according to an embodiment of the present application. As shown in fig. 9, the apparatus 9000 comprises an obtaining module 9100, a generating module 9200 and an aggregating module 9300.

The obtaining module 9100 is configured to obtain motion estimation values corresponding to a plurality of segments in a video.

The generating module 9200 is configured to generate a time sequence motion degree sequence of the video according to the motion degree estimates respectively corresponding to the multiple segments.

The aggregation module 9300 is configured to aggregate the motion prediction time domain interval in the video based on the time sequence motion degree sequence.

FIG. 10 shows a block diagram of an acquisition module according to one embodiment of the present application. As shown in fig. 10, the acquisition module 9100 includes a first convolutional neural network 9110, a second convolutional neural network 9120, and a first acquisition submodule 9130.

The first convolutional neural network 9110 is configured to calculate a first motion estimation value corresponding to each of a plurality of segments in the video, where the first motion estimation value is obtained from any one of the frames of images in the segments.

The second convolutional neural network 9120 is configured to calculate a second motion degree estimation value corresponding to each of the segments, where the second motion degree estimation value is obtained from optical flow field images of the segments, and the optical flow field images are formed by merging optical flow fields extracted from any multiple frames of images in the segments.

The first obtaining sub-module 9130 is configured to obtain, according to the first action degree estimate and the second action degree estimate respectively corresponding to each segment, an action degree estimate respectively corresponding to each segment.

According to an embodiment, the first obtaining sub-module 9130 may add the first motion estimation value and the second motion estimation value corresponding to each segment, respectively, to obtain the motion estimation value corresponding to each segment.

According to another embodiment of the present application, the generating module 9200 may perform normalization processing on the motion degree estimation values corresponding to the plurality of segments, and generate a time sequence motion degree sequence of the video according to each motion degree estimation value after the normalization processing.

FIG. 11 shows a block diagram of an aggregation module, according to an embodiment of the present application. As shown in fig. 11, the aggregation module 9300 includes a second obtaining sub-module 9310 and an aggregation sub-module 9320.

The second obtaining sub-module 9310 is configured to obtain each foreground segment in the time-series motion degree sequence, where a motion degree estimate corresponding to the foreground segment is greater than or equal to a motion degree threshold.

The aggregating submodule 9320 is configured to aggregate the plurality of adjacent foreground segments into the action prediction time domain interval based on a tolerance threshold. Wherein a tolerance threshold is used to characterize the proportion of the foreground segment in the motion prediction temporal interval.

Fig. 12 shows a block diagram of an apparatus for detecting motion in a video according to another embodiment of the present application. As shown in fig. 12, the apparatus 9000' includes an action category determining module 9400 in addition to the acquiring module 9100, the generating module 9200 and the aggregating module 9300.

The action type determining module 9400 is configured to determine an action type corresponding to the action prediction time domain interval according to a segment included in the action prediction time domain interval.

FIG. 13 illustrates a block diagram of an action category determination module according to one embodiment of the present application. As shown in fig. 13, the action category determination module 9400 includes a score calculation sub-module 9410 and a first determination sub-module 9420.

The score calculating sub-module 9410 is configured to obtain scores of the segments in the motion prediction time domain interval based on each motion category, and combine the scores of the segments to obtain a total score of the current motion category.

The first determining sub-module 9420 is configured to determine a target motion category of the motion prediction time domain interval according to the total score of each motion category.

According to one embodiment, the first determination sub-module 9420 determines the action category with the highest overall score as the target action category for the action prediction temporal interval.

Fig. 14 shows a block diagram of an apparatus for detecting motion in a video according to another embodiment of the present application. As shown in fig. 14, the apparatus 9000 ″ further includes an action integrity determination module 9500 in addition to the acquisition module 9100, the generation module 9200, the aggregation module 9300, and the action category determination module 9400.

The motion integrity determination module 9500 is configured to determine the motion integrity of the motion prediction time domain interval according to each segment included in the motion prediction time domain interval and a segment adjacent to the motion prediction time domain interval.

FIG. 15 shows a block diagram of an action integrity determination module according to one embodiment of the present application. As shown in fig. 15, the motion integrity determination module 9500 includes a third acquisition sub-module 9510 and a second determination sub-module 9520.

The third obtaining sub-module 9510 is configured to obtain a score of each adjacent segment of the motion prediction time domain interval based on the target motion category.

The second determining sub-module 9520 is configured to determine the motion integrity of the motion prediction time domain interval according to each segment in the motion prediction time domain interval and the score of each adjacent segment.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or computer program product. Accordingly, this application may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to as a "circuit," module "or" system. Furthermore, the present application may take the form of a computer program product embodied in any tangible expression medium having computer-usable program code embodied in the medium.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Although the above description includes many specific arrangements and parameters, it should be noted that these specific arrangements and parameters are merely illustrative of one embodiment of the present application. This should not be taken as limiting the scope of the application. Those skilled in the art will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the application. Accordingly, the scope of the application should be construed based on the claims.

Claims

1. A method of detecting motion in a video, comprising:

obtaining action degree estimated values respectively corresponding to a plurality of segments in a video;

generating a time sequence action degree sequence of the video according to the action degree estimated values respectively corresponding to the plurality of segments; and

and acquiring foreground fragments containing action instances in the time sequence action degree sequence, and aggregating a plurality of foreground fragments to obtain action prediction time domain intervals corresponding to one or more fragments in the video.

2. The method of claim 1, wherein said obtaining the motion estimation values corresponding to the plurality of segments in the video comprises:

calculating a first action degree estimation value corresponding to each of a plurality of segments in the video by using a first convolutional neural network, wherein the first action degree estimation value is obtained from any one frame of image in the segments;

calculating a second action degree estimation value corresponding to each segment by using a second convolutional neural network, wherein the second action degree estimation value is obtained from optical flow field images of the segments, and the optical flow field images are formed by combining optical flow fields extracted from any multi-frame images in the segments; and

and obtaining the action degree estimated value corresponding to each segment according to the first action degree estimated value and the second action degree estimated value corresponding to each segment.

3. The method of claim 2, wherein the obtaining the motion estimation value corresponding to each segment according to the first motion estimation value and the second motion estimation value corresponding to each segment comprises:

and adding the first action degree estimation value and the second action degree estimation value corresponding to each segment to obtain the action degree estimation value corresponding to each segment.

4. The method of claim 1, wherein said generating a time-sequential motion estimation sequence of the video based on the motion estimation values respectively corresponding to the plurality of segments comprises:

and normalizing the action degree estimated values respectively corresponding to the plurality of segments, and generating a time sequence action degree sequence of the video according to each action degree estimated value after the normalization processing.

5. The method of claim 1, wherein the obtaining foreground segments of the time-series action degree sequence, which include action instances, and the aggregating a plurality of the foreground segments to obtain action prediction time domain intervals corresponding to one or more segments in the video comprises:

obtaining each foreground segment containing an action instance in the time sequence action degree sequence, wherein an action degree estimated value corresponding to the foreground segment is greater than or equal to an action degree threshold value; aggregating a plurality of the foreground segments that are adjacent to each other into the motion prediction temporal interval based on a tolerance threshold.

6. The method of claim 5, wherein the tolerance threshold is used to characterize a proportion of the foreground segments in the motion prediction temporal interval.

7. The method of claim 1, further comprising:

and determining the action type corresponding to the action prediction time domain interval according to the segments contained in the action prediction time domain interval.

8. The method of claim 7, wherein the determining, according to the segments included in the motion prediction time domain interval, the motion category corresponding to the motion prediction time domain interval comprises:

respectively obtaining the scores of all the segments in the action prediction time domain interval based on each action category, and combining the scores of all the segments to obtain the total score of the current action category; and

and determining the target action category of the action prediction time domain interval according to the total score of each action category.

9. The method of claim 8, wherein the determining a target action category for the action prediction temporal interval from the total score for each action category comprises:

and determining the action category with the highest total score as a target action category of the action prediction time domain interval.

10. The method of any one of claims 8-9, further comprising:

and determining the action integrity of the action prediction time domain interval according to each segment contained in the action prediction time domain interval and the adjacent segment of the action prediction time domain interval.

11. The method of claim 10, wherein the determining the motion integrity of the motion prediction temporal interval according to each segment included in the motion prediction temporal interval and the adjacent segments of the motion prediction temporal interval comprises:

obtaining scores of adjacent segments of the action prediction time domain interval based on the target action category; and

and determining the action integrity of the action prediction time domain interval according to each segment in the action prediction time domain interval and the score of each adjacent segment.

12. The method of claim 1, wherein the motion prediction temporal interval comprises a motion prediction start time and a motion prediction end time.

13. An apparatus to detect motion in video, comprising:

the acquisition module is used for acquiring action degree estimated values respectively corresponding to a plurality of segments in the video;

the generating module is used for generating a time sequence action degree sequence of the video according to the action degree estimated values respectively corresponding to the plurality of segments; and

and the aggregation module is used for acquiring foreground segments containing action instances in the time sequence action degree sequence, and aggregating the plurality of foreground segments to obtain action prediction time domain intervals corresponding to one or more segments in the video.

14. The apparatus of claim 13, wherein the acquisition module comprises:

the first convolutional neural network is used for calculating a first action degree estimation value corresponding to each of a plurality of segments in the video, and the first action degree estimation value is obtained from any one frame of image in the segments;

the second convolutional neural network is used for calculating a second action degree estimated value corresponding to each segment, the second action degree estimated value is obtained from the optical flow field images of the segments, and the optical flow field images are formed by combining optical flow fields extracted from any multi-frame images in the segments; and

and the first obtaining submodule is used for obtaining the action degree estimated value corresponding to each segment according to the first action degree estimated value and the second action degree estimated value corresponding to each segment.

15. The apparatus of claim 14, wherein the first obtaining sub-module adds the first and second motion estimation values corresponding to each of the segments to obtain the motion estimation value corresponding to each of the segments.

16. The apparatus of claim 13, wherein the generating module normalizes the motion estimation values corresponding to the segments, and generates the time-series motion sequence of the video according to the normalized motion estimation values.

17. The apparatus of claim 13, wherein the aggregation module comprises:

the second obtaining submodule is used for obtaining each foreground segment containing an action example in the time sequence action degree sequence, and an action degree estimated value corresponding to the foreground segment is greater than or equal to an action degree threshold value; and

and the aggregation sub-module is used for aggregating a plurality of adjacent foreground segments into the action prediction time domain interval based on a tolerance threshold.

18. The apparatus of claim 17, wherein the tolerance threshold is used to characterize a proportion of the foreground segment in the motion prediction temporal interval.

19. The apparatus of claim 13, further comprising:

and the action type determining module is used for determining the action type corresponding to the action prediction time domain interval according to the fragments contained in the action prediction time domain interval.

20. The apparatus of claim 19, wherein the action category determination module comprises:

the score calculation submodule is used for acquiring the scores of all the segments in the action prediction time domain interval based on each action category and combining the scores of all the segments to obtain the total score of the current action category; and

and the first determining submodule is used for determining the target action category of the action prediction time domain interval according to the total score of each action category.

21. The apparatus of claim 20, wherein the first determination submodule determines a target action category for the action prediction temporal interval from the action category with the highest overall score.

22. The apparatus of any of claims 20-21, further comprising:

and the action integrity determining module is used for determining the action integrity of the action prediction time domain interval according to each segment contained in the action prediction time domain interval and the adjacent segment of the action prediction time domain interval.

23. The apparatus of claim 22, wherein the action integrity determination module comprises:

a third obtaining sub-module, configured to obtain, based on the target action category, a score of each adjacent segment of the action prediction time domain interval; and

and the second determining submodule is used for determining the action integrity of the action prediction time domain interval according to each segment in the action prediction time domain interval and the score of each adjacent segment.

24. The apparatus of claim 13, wherein the motion prediction temporal interval comprises a motion prediction start time and a motion prediction end time.

25. A system for detecting motion in a video, comprising:

a memory for storing executable instructions; and

a processor in communication with the memory to execute the executable instructions to: