CN111368786A

CN111368786A - Action region extraction method, device, equipment and computer readable storage medium

Info

Publication number: CN111368786A
Application number: CN202010185060.0A
Authority: CN
Inventors: 张国辉; 朱文和
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-03-16
Filing date: 2020-03-16
Publication date: 2020-07-03
Also published as: WO2021184852A1

Abstract

The invention provides an action area extraction method, which relates to the technical field of image processing and comprises the following steps: acquiring a video to be trimmed, and performing feature extraction on the video to be trimmed to obtain a first feature sequence; inputting the first characteristic sequence into a preset time sequence evaluation model to obtain time sequence information; detecting whether the time sequence information meets a preset condition or not, and obtaining time period information of an action area according to a detection result; and extracting a corresponding action area from the video to be modified according to the time period information. The invention also provides an action area extraction device, equipment and a computer readable storage medium. The method can solve the problem of poor accuracy of the existing action area extraction method.

Description

Action region extraction method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for extracting an action region.

Background

Video content analysis is a popular research topic in the field of AI (Artificial Intelligence), wherein action recognition is one of important branches of video analysis, and the video content analysis has a wide application prospect in various fields such as intelligent video monitoring, man-machine interaction, motion analysis, video retrieval and the like, and is widely concerned by scholars at home and abroad.

In the course of action recognition, the video is required to be clipped first, and a plurality of video clips containing only one action instance are obtained. But the recorded video in a real-world scene is usually long and contains much content that is not relevant to the action instance. At this time, the action instances in the un-cropped video are usually detected by means of timing detection. Specifically, the timing detection task can be divided into two stages: and extracting and classifying action areas. The extracting stage of the action area aims to extract a video area containing the action instance, and the classifying stage classifies the action area. Therefore, obtaining a high-quality action area is a key for ensuring the accuracy of the detection result of the action instance.

Currently, sliding is usually performed at fixed intervals using sliding time windows of multiple durations to extract motion regions, but motion regions extracted using predefined durations and intervals have the following drawbacks: 1) time is often inaccurate; 2) the duration of the action instance in a real scene is complex and variable, and cannot be flexibly covered, especially in a large time range. Therefore, the existing action region extraction method has the problem of poor accuracy.

Disclosure of Invention

The invention mainly aims to provide an action area extraction method, an action area extraction device, action area extraction equipment and a computer readable storage medium, and aims to solve the problem that the existing action area extraction method is poor in accuracy.

In order to achieve the above object, the present invention provides an action region extraction method, including:

acquiring a video to be trimmed, and performing feature extraction on the video to be trimmed to obtain a first feature sequence;

inputting the first characteristic sequence into a preset time sequence evaluation model to obtain time sequence information;

detecting whether the time sequence information meets a preset condition or not, and obtaining time period information of an action area according to a detection result;

and extracting a corresponding action area from the video to be modified according to the time period information.

Optionally, the step of obtaining a video to be cropped, and performing feature extraction on the video to be cropped to obtain a first feature sequence includes:

acquiring a video to be trimmed, and performing framing processing on the video to be trimmed to obtain a video image sequence;

acquiring a target video image every other preset frame number in the video image sequence, and extracting red, green and blue (RGB) features and optical flow features of the target video image to obtain an RGB feature sequence and an optical flow feature sequence;

and splicing the RGB characteristic sequence and the optical flow characteristic sequence to obtain a first characteristic sequence.

Optionally, the time sequence information includes an action segment start probability and an action segment end probability corresponding to each target video image, and the step of detecting whether the time sequence information meets a preset condition and obtaining time period information of an action area according to a detection result includes:

detecting whether a probability value larger than a first preset threshold value exists in the action segment starting probability to obtain a first detection result, and obtaining an action segment starting time array according to the first detection result;

detecting whether a probability value larger than a second preset threshold value exists in the action segment ending probabilities to obtain a second detection result, and obtaining an action segment ending time array according to the second detection result;

and combining to obtain the time period information of the action area according to the action segment starting time array and the action segment ending time array.

Optionally, before the step of extracting the corresponding action region from the video to be modified according to the time period information, the method further includes:

sampling the characteristics of each action area based on the first characteristic sequence and the time period information to obtain a second characteristic sequence;

inputting the second characteristic sequence into a preset action region evaluation model to obtain an action region evaluation score;

the step of extracting the corresponding action area from the video to be modified according to the time period information comprises the following steps:

and extracting a corresponding action area from the video to be modified according to the action area evaluation score and the time period information.

Optionally, the step of sampling the features of each motion region based on the first feature sequence and the time period information to obtain a second feature sequence includes:

sampling the characteristics of each action area by adopting a linear difference method based on the first characteristic sequence and the time period information to obtain a first preset amount of characteristic data;

and splicing the first preset number of feature data to obtain a second feature sequence.

Optionally, the step of extracting a corresponding action region from the video to be modified according to the action region evaluation score and the time period information includes:

sequencing the action region evaluation scores in a descending order;

obtaining a second preset number of action area evaluation scores according to the sorting result, and taking the second preset number of action area evaluation scores as target action area evaluation scores;

acquiring target time period information corresponding to the target action area evaluation score from the time period information;

and extracting a corresponding action area from the video to be modified according to the target time period information.

In order to achieve the above object, the present invention also provides an operation region extraction device including:

the device comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for acquiring a video to be trimmed and extracting features of the video to be trimmed to obtain a first feature sequence;

the information acquisition module is used for inputting the first characteristic sequence into a preset time sequence evaluation model to obtain time sequence information;

the information detection module is used for detecting whether the time sequence information meets a preset condition or not and obtaining time period information of the action area according to a detection result;

and the region extraction module is used for extracting a corresponding action region from the video to be modified according to the time period information.

Optionally, the action region extracting apparatus further includes:

the characteristic sampling module is used for sampling the characteristics of each action area based on the first characteristic sequence and the time period information to obtain a second characteristic sequence;

the score evaluation module is used for inputting the second characteristic sequence into a preset action region evaluation model to obtain an action region evaluation score;

the region extraction module is specifically configured to extract a corresponding action region from the video to be modified according to the action region evaluation score and the time period information.

Further, to achieve the above object, the present invention also provides an action region extraction device comprising a memory, a processor, and an action region extraction program stored on the memory and executable by the processor, wherein the action region extraction program, when executed by the processor, implements the steps of the action region extraction method as described above.

Further, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon an action region extraction program, wherein the action region extraction program, when executed by a processor, implements the steps of the action region extraction method as described above.

The invention provides a method, a device and equipment for extracting an action area and a computer readable storage medium, wherein a first characteristic sequence is obtained by acquiring a video to be trimmed and extracting characteristics of the video to be trimmed; then, inputting the first characteristic sequence into a preset time sequence evaluation model to obtain time sequence information; detecting whether the time sequence information meets a preset condition or not, and obtaining time period information of an action area according to a detection result; and then extracting a corresponding action area from the video to be modified according to the time period information. In the embodiment of the invention, the characteristics are extracted, and then the corresponding time sequence information is obtained, namely the time positions corresponding to the characteristics respectively belong to the probabilities of the beginning, the middle and the end of the action segment, so that the boundary positions with high probability, namely the beginning time and the end time of the action region are screened out, and the accurate action region is extracted.

Drawings

FIG. 1 is a schematic diagram of an apparatus architecture of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of a method for extracting an action region according to the present invention;

FIG. 3 is a detailed flowchart of step S10 in the first embodiment of the present invention;

FIG. 4 is a detailed flowchart of step S30 in the first embodiment of the present invention;

FIG. 5 is a flowchart illustrating a second embodiment of the method for extracting an action region according to the present invention;

fig. 6 is a functional block diagram of the operation region extraction device according to the first embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.

The motion area extracting device according to the embodiment of the present invention may be a terminal device such as a PC (personal computer), a notebook computer, or a server.

As shown in fig. 1, the action region extracting apparatus may include: a processor 1001, such as a CPU (central processing Unit), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. The communication bus 1002 is used for realizing connection communication among the components; the user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wi-Fi interface, Wireless-Fidelity, Wi-Fi interface); the memory 1005 may be a Random Access Memory (RAM) or a non-volatile memory (non-volatile memory), such as a disk memory, and the memory 1005 may optionally be a storage device independent of the processor 1001. Those skilled in the art will appreciate that the configuration of the action area extraction device shown in fig. 1 does not constitute a limitation of the action area extraction device, and may include more or less components than those shown, or some components in combination, or a different arrangement of components.

With continued reference to fig. 1, a memory 1005, which is one type of computer storage medium in fig. 1, may include therein an operating system, a network communication module, and an action region extraction program. In fig. 1, the network communication module may be used to connect to a server and perform data communication with the server; and the processor 1001 may be configured to call the action area extraction program stored in the memory 1005 and execute the action area extraction method provided by the embodiment of the present invention.

Based on the above hardware structure, embodiments of the motion region extraction method of the present invention are provided.

The invention provides an action region extraction method.

Referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of the method for extracting an action region according to the present invention.

In this embodiment, the motion region extraction method includes:

step S10, acquiring a video to be trimmed, and performing feature extraction on the video to be trimmed to obtain a first feature sequence;

in the present embodiment, the motion region extracting method is implemented by a motion region extracting device, which may be a PC, a notebook computer, a server, or the like, and the motion region extracting device is described by taking a server as an example. The action region extraction method provided by the embodiment of the invention can be applied to scenes such as security protection, monitoring, wonderful action picture editing and the like, and is used for extracting and obtaining the region containing the action from the video.

In this embodiment, a video to be trimmed is obtained first, and then feature extraction is performed on the video to be trimmed to obtain a first feature sequence. Specifically, firstly, performing framing processing on a video to be trimmed to obtain a video image sequence; then, acquiring a target video image every other preset frame number in the video image sequence, and extracting RGB (Red-Green-Blue) features and optical flow features of the target video image to obtain an RGB feature sequence and an optical flow feature sequence; and further splicing the RGB characteristic sequence and the optical flow characteristic sequence to obtain a first characteristic sequence. For a specific process of acquiring the first signature sequence, reference may be made to the following embodiments, which are not described herein again.

Step S20, inputting the first characteristic sequence into a preset time sequence evaluation model to obtain time sequence information;

after the first characteristic sequence is obtained, the first characteristic sequence is input into a preset time sequence evaluation model, and time sequence information is obtained. The time sequence information comprises action segment starting probability, action segment middle probability and action segment ending probability corresponding to each target video image, wherein the action segment starting probability corresponding to each target video image is the probability that each target video image belongs to the action segment starting part; the middle probability of the action segment corresponding to each target video image is the probability that each target video image belongs to the middle part of the action segment; and the action segment ending probability corresponding to each target video image is the probability that each target video image belongs to the action segment ending part. For convenience of explanation, the action segment start probability, the action segment intermediate probability, and the action segment end probability are respectively denoted as Ps, Pm, and Pe. The pre-preset time sequence evaluation model is obtained by training based on a training sample and a pre-constructed convolutional neural network model, the pre-preset time sequence evaluation model is composed of 3 convolutional layers, the first 2 layers are completely consistent, the number of filters is 512, the number of convolutional cores is 3, an activation function is Relu, and the step length is 1. The number of last layer filters is 3, the convolution kernel is 3, and the activation function is Sigmoid, as follows:

Conv(512,3,Relu)→Conv(512,3,Relu)→Conv(3,1,Sigmoid)

wherein, the 3 filters of the last layer of convolution layer are respectively output as 3 probability values of beginning, middle and ending.

The Loss function consists of 3 losses starting(s), middle (m), and ending (e). Can be expressed as:

J＝λL(s)+L(m)+L(e)

l is a binary logistic regression function. Can be expressed as:

L＝∑bi*log(Pi)+(1-bi)*log(1-Pi)

where bi is an indicator function, 1 when true, and 0 when false.

Step S30, detecting whether the time sequence information meets the preset condition, and obtaining the time period information of the action area according to the detection result;

and then, detecting whether the time sequence information meets a preset condition or not, and obtaining time period information of the action area according to a detection result. The time period information is video time period information belonging to the action region, and may include multiple groups of time periods, each time period being composed of a start time and an end time of an action segment.

Specifically, as one of the detection modes, it may be detected whether a probability value greater than a first preset threshold exists in the action segment start probabilities to obtain a first detection result, and obtain an action segment start time array according to the first detection result; meanwhile, whether the probability value larger than a second preset threshold value exists in the action segment ending probability is detected to obtain a second detection result, and an action segment ending time array is obtained according to the second detection result; and then, combining the action segment start time array and the action segment end time array to obtain the time period information of the action area.

As another detection mode, whether a peak exists in the motion segment start probabilities, that is, the motion segment start probabilities at a time greater than the previous time and the next time, are detected to obtain a first detection result, and a motion segment start time array is obtained according to the first detection result; meanwhile, whether a peak value exists in the action segment ending probability is detected, namely the action segment ending probability which is greater than the previous moment and the next moment at the same time, a second detection result is obtained, and an action segment ending time array is obtained according to the second detection result; and then, combining the action segment start time array and the action segment end time array to obtain the time period information of the action area.

Of course, in specific implementation, the detection conditions of the detection manner may be combined, and only when any one of the 2 detection conditions is met, or when the 2 conditions are met simultaneously, the time of the corresponding target video image is stored in the action segment start/end time array, so as to obtain the time period information. For a specific detection process, reference may be made to the following embodiments, which are not described herein in detail.

And step S40, extracting a corresponding action area from the video to be modified according to the time period information.

And after the time period information is acquired, extracting a corresponding action area from the video to be modified according to the time period information.

The embodiment of the invention provides an action area extraction method, which comprises the steps of obtaining a video to be trimmed, and extracting the characteristics of the video to be trimmed to obtain a first characteristic sequence; then, inputting the first characteristic sequence into a preset time sequence evaluation model to obtain time sequence information; detecting whether the time sequence information meets a preset condition or not, and obtaining time period information of an action area according to a detection result; and then extracting a corresponding action area from the video to be modified according to the time period information. In the embodiment of the invention, the characteristics are extracted, and then the corresponding time sequence information is obtained, namely the time positions corresponding to the characteristics respectively belong to the probabilities of the beginning, the middle and the end of the action segment, so that the boundary positions with high probability, namely the beginning time and the end time of the action region are screened out, and the accurate action region is extracted.

Further, referring to fig. 3, fig. 3 is a detailed flowchart of step S10 in the first embodiment of the present invention;

in the present embodiment, step S10 includes:

step S11, acquiring a video to be trimmed, and performing framing processing on the video to be trimmed to obtain a video image sequence;

in this embodiment, a video to be cropped is obtained first, and then the video to be cropped is subjected to framing processing to obtain a video image sequence. The video image sequence comprises all frames of video images of a video to be trimmed and all frames of video images are arranged according to a time sequence.

Step S12, acquiring a target video image every other preset frame number in the video image sequence, and extracting red, green and blue (RGB) features and optical flow features of the target video image to obtain an RGB feature sequence and an optical flow feature sequence;

the method comprises the steps of acquiring a target video image every other preset frame number in a video image sequence, and then extracting RGB (Red-Green-Blue) features and optical flow features of the target video image to obtain an RGB feature sequence and an optical flow feature sequence. It is understood that, since the target video image includes a plurality of, corresponding, RGB features and optical flow features, a sequence of RGB features may be formed based on the time sequence of the target video image corresponding to each set of RGB features, and similarly, a sequence of optical flow features may be formed based on the time sequence of the target video image corresponding to each set of optical flow features.

When the target video image is obtained, it is assumed that the video to be cropped has N frames, and in order to save the calculation amount, the extraction may be set once every M frames, so that L ═ N/M target video images may be obtained, and correspondingly, L segments may be obtained as the extracted features.

For the extraction of the RGB feature and the optical flow feature, a common TSN (Temporal segmentation networks) algorithm may be adopted to extract the RGB feature and the optical flow feature from the target video image. The TSN is constructed based on a two stream (dual stream) method, and a specific feature extraction method may refer to the prior art and is not described herein again. Furthermore, it should be noted that in particular embodiments, other types of features may also be extracted for extracting the action region.

And step S13, splicing the RGB characteristic sequence and the optical flow characteristic sequence to obtain a first characteristic sequence.

After the RGB characteristic sequence and the optical flow characteristic sequence are obtained, the RGB characteristic sequence and the optical flow characteristic sequence are spliced to obtain a first characteristic sequence. For example, in the above example, after obtaining L segments of RGB feature sequences and L segments of optical flow feature sequences, each segment of RGB features is a 200 × 100 dimensional matrix, and each segment of optical flow features is a 200 × 100 dimensional matrix, when performing the splicing, a first segment of RGB features and a first segment of optical flow features may be first spliced to obtain a 400 × 100 dimensional matrix, that is, a first feature vector of the first feature sequence F is a 400 × 100 dimensional matrix, and similarly, each subsequent segment of RGB features and optical flow features are spliced to obtain the first feature sequence.

By the method, the features of the video to be trimmed are extracted to obtain the corresponding first feature sequence, so that the target action region can be extracted subsequently. Meanwhile, in the embodiment, the first feature sequence is formed by extracting the features of the corresponding video images at intervals of the preset frames, so that the calculation amount can be saved, the extraction speed of the action region can be increased, and the method can be applied to extraction of the action region in the long video under the scenes of security protection, monitoring and the like.

Further, referring to fig. 4, fig. 4 is a detailed flowchart of step S30 in the first embodiment of the present invention

In this embodiment, the timing information includes an action segment start probability and an action segment end probability corresponding to each target video image, and step S30 includes:

step S31, detecting whether the probability value of the action segment starting probability is larger than a first preset threshold value to obtain a first detection result, and obtaining an action segment starting time array according to the first detection result;

in this embodiment, the time sequence information includes an action segment start probability and an action segment end probability corresponding to each target video image, where the action segment start probability corresponding to each target video image is a probability that each target video image belongs to an action segment start part; and the action segment ending probability corresponding to each target video image is the probability that each target video image belongs to the action segment ending part. In addition, the time sequence information also comprises the action fragment middle probability corresponding to each target video image. For convenience of explanation, the action segment start probability is denoted as Ps, and the action segment end probability is denoted as Pe. It can be understood that, in the above example, since there are L target video images acquired, there are L corresponding Ps and Pe.

Whether a probability value larger than a first preset threshold value exists in the action segment starting probability is detected, a first detection result is obtained, and an action segment starting time array is obtained according to the first detection result. The first preset threshold is preset, and is not limited herein. Obtaining Ps larger than a first preset threshold value, further determining the time Ts of the target video image corresponding to the Ps larger than the first preset threshold value, and further forming an action fragment starting time array { Ts } based on the determined time Ts.

Of course, in particular embodiments, other detection rules may be employed to obtain the action segment start time array. For example, whether a peak value of Ps at a certain time exists is detected, that is, the peak value of Ps is larger than Ps at a previous time and a next time, and if the condition is met, the time Ts of the target video image corresponding to the Ps meeting the condition is stored in an array to obtain an action segment start time array { Ts }. Of course, the detection may be performed in combination with the 2 conditions, and the time of the corresponding target video image is stored in the action segment start time array { Ts } only if any one of the 2 conditions is met or if the 2 conditions are met simultaneously.

Step S32, detecting whether the probability value of the action segment ending probability is larger than a second preset threshold value to obtain a second detection result, and obtaining an action segment ending time array according to the second detection result;

and detecting whether a probability value larger than a second preset threshold value exists in the action segment ending probabilities to obtain a second detection result, and obtaining an action segment ending time array according to the second detection result. The second preset threshold is preset, and is not specifically limited herein, and the second preset threshold may be the same as or different from the first preset threshold. And obtaining Pe larger than a second preset threshold value, further determining the time Te of the target video image corresponding to the Pe larger than the second preset threshold value, and further forming an action fragment ending time array { Te } based on the determined Te.

Similarly, in an embodiment, other detection rules may be employed to obtain the action segment end time array. For example, it is detected whether there is a peak Pe at a certain time, that is, a Pe greater than the preceding time and the succeeding time, and if the condition is satisfied, the time Te of the target video image corresponding to the Pe that satisfies the condition is stored in the array, and the action segment end time array { Te } is obtained. Of course, the detection may be performed in combination with the 2 conditions, and the time of the corresponding target video image is stored in the action segment ending time array { Te } only if any one of the 2 conditions is met or if the 2 conditions are met simultaneously.

It should be noted that the execution sequence of step S31 and step S32 is not sequential.

And step S33, combining the action segment start time array and the action segment end time array to obtain the time slot information of the action area.

After the action fragment start time array { Ts } and the action fragment end time array { Te } are obtained, time slot information of the action area is obtained through combination according to the action fragment start time array { Ts } and the action fragment end time array { Te }. Specifically, one action region time period is sequentially combined from each selected value of { Ts } and { Te }, and finally time period information of a plurality of action regions can be obtained and can be represented in the form of an action region time period array. Obviously, Te is required to be larger than Ts every time selection is carried out, so that an action area time period array is formed.

In this embodiment, by detecting and screening the start probability and the end probability of the action segment in the sequence information, the boundary position with high probability, that is, the start time and the end time of the action region, is obtained, so as to extract an accurate action region subsequently.

Further, based on the above embodiments, a second example of the operation region extraction method of the present invention is proposed. Referring to fig. 5, fig. 5 is a flowchart illustrating a second embodiment of the method for extracting an action region according to the present invention;

in this embodiment, before step S40, the motion region extraction method further includes:

step S50, sampling the characteristics of each action area based on the first characteristic sequence and the time period information to obtain a second characteristic sequence;

for the local range, in order to further improve the accuracy and the precision of the extraction result of the action region, the obtained time period information of the action region is evaluated in the global range in the embodiment to obtain a reliable confidence score of the action region for retrieval, so that the action region is obtained, and the accuracy and the precision of the extraction result of the action region can be further improved.

In this embodiment, after obtaining the time period information of the motion region according to the detection result, the features of each motion region are sampled based on the first feature sequence and the time period information to obtain a second feature sequence.

Specifically, step S50 includes:

a1, sampling the characteristics of each action area by adopting a linear difference method based on the first characteristic sequence and the time period information to obtain a first preset amount of characteristic data;

step a2, splicing the first preset number of feature data to obtain a second feature sequence.

Because the time lengths of the action areas corresponding to each time period information are different, the characteristics of each action area can be sampled by adopting a linear difference method based on the first characteristic sequence and the time period information to obtain a first preset number of characteristic data. That is, a first preset amount of feature data is sampled from a time period of each action region by using a linear difference method. The linear interpolation is an interpolation mode in which an interpolation function is a first-order polynomial, and an interpolation error of the linear interpolation on an interpolation node is zero. Of course, in specific implementation, other interpolation modes, such as parabolic interpolation, may also be adopted, but linear interpolation has the characteristics of simplicity and convenience compared with other interpolation modes. The first preset number is optionally set to 32, and may be specifically set according to actual situations, and is not limited herein.

After the first preset number of feature data are obtained, the first preset number of feature data are spliced to obtain a second feature sequence.

Step S60, inputting the second characteristic sequence into a preset action area evaluation model to obtain an action area evaluation score;

and after the second characteristic sequence is obtained through sampling, inputting the second characteristic sequence into a preset action region evaluation model to obtain an action region evaluation score. The preset action region evaluation model is obtained by training based on a training sample and a preset candidate region evaluation model, and the network structure of the preset action region evaluation model is two fully-connected layers, wherein the hidden layer of the layer 1 comprises 512 units, and the activation function is Relu; the layer 2 activation function is Sigmoid, and the confidence that the action region contains the action segment is output as follows:

FC(512,Relu)→FC(1,Sigmoid)

wherein the Loss function is a simple regression Loss, and the variance of the confidence coefficient and the intersection ratio g of the candidate region and the real segment is used. The definition is as follows:

wherein N is the number of action areas.

At this time, step S40 includes:

and step S41, extracting a corresponding action area from the video to be modified according to the action area evaluation score and the time period information.

And after the action area evaluation score is obtained, extracting a corresponding action area from the video to be modified according to the action area evaluation score and the time period information.

Specifically, step S41 includes:

b1, sequencing the action region evaluation scores in a descending order;

step b2, obtaining the action area evaluation scores of the first preset number according to the sorting result, and taking the action area evaluation scores as target action area evaluation scores;

step b3, obtaining target time period information corresponding to the target action area evaluation score from the time period information;

step b4, extracting a corresponding action area from the video to be modified according to the target time period information.

The extraction process of the action area is as follows: the method comprises the steps of firstly sequencing action area evaluation scores in a descending order, obtaining a second preset number of action area evaluation scores as target action area evaluation scores according to a sequencing result, and then obtaining target time period information corresponding to the target action area evaluation scores from the time period information; and further, extracting a corresponding action area from the video to be modified according to the target time period information.

In this embodiment, by extracting the features, the corresponding time sequence information is obtained, that is, the probabilities that the time positions corresponding to the features belong to the start, middle, and end of the action segment, respectively, and then the boundary positions with high probability, that is, the start time and the end time of the action region are screened out to obtain the local accurate action region boundary, and then, the candidate region-level features are further evaluated in the global range to obtain the reliable action region confidence score for retrieval, so that the action region is obtained, and the accuracy and precision of the action region extraction result can be further improved.

The invention also provides an action area extraction device.

Referring to fig. 6, fig. 6 is a functional block diagram of the operation region extraction device according to the first embodiment of the present invention.

In this embodiment, the motion region extraction device includes:

the feature extraction module 10 is configured to acquire a video to be trimmed, and perform feature extraction on the video to be trimmed to obtain a first feature sequence;

the information acquisition module 20 is configured to input the first feature sequence to a preset time sequence evaluation model to obtain time sequence information;

the information detection module 30 is configured to detect whether the timing information meets a preset condition, and obtain time period information of an action area according to a detection result;

and the region extraction module 40 is configured to extract a corresponding action region from the video to be modified according to the time period information.

Each virtual function module of the operation region extraction device is stored in the memory 1005 of the operation region extraction device shown in fig. 1, and is used for realizing all functions of the operation region extraction program; when executed by the processor 1001, each module may implement a function of improving the accuracy of the action region extraction result.

Further, the feature extraction module 10 includes:

the frame processing unit is used for acquiring a video to be trimmed and performing frame processing on the video to be trimmed to obtain a video image sequence;

the feature extraction unit is used for acquiring a target video image every other preset frame number in the video image sequence, extracting red, green and blue (RGB) features and optical flow features of the target video image, and obtaining an RGB feature sequence and an optical flow feature sequence;

and the first splicing unit is used for splicing the RGB characteristic sequence and the optical flow characteristic sequence to obtain a first characteristic sequence.

Further, the timing information includes an action segment start probability and an action segment end probability corresponding to each target video image, and the information detection module 30 includes:

the first detection unit is used for detecting whether a probability value larger than a first preset threshold value exists in the action segment starting probability to obtain a first detection result, and obtaining an action segment starting time array according to the first detection result;

the second detection unit is used for detecting whether a probability value larger than a second preset threshold value exists in the action segment ending probabilities to obtain a second detection result, and obtaining an action segment ending time array according to the second detection result;

and the information combination unit is used for combining the action segment start time array and the action segment end time array to obtain the time period information of the action area.

Further, the motion region extraction device further includes:

the region extraction module 40 is specifically configured to extract a corresponding action region from the video to be modified according to the action region evaluation score and the time period information.

Further, the feature sampling module includes:

the characteristic sampling unit is used for sampling the characteristics of each action area by adopting a linear difference method based on the first characteristic sequence and the time period information to obtain a first preset number of characteristic data;

and the second splicing unit is used for splicing the first preset number of feature data to obtain a second feature sequence.

Further, the region extraction module 40 includes:

the score sorting unit is used for sorting the action region evaluation scores in a descending order;

the first obtaining unit is used for obtaining the action area evaluation scores of the first preset number according to the sorting result and taking the action area evaluation scores as target action area evaluation scores;

a second obtaining unit, configured to obtain, from the time period information, target time period information corresponding to the target action region evaluation score;

and the region extraction unit is used for extracting a corresponding action region from the video to be modified according to the target time period information.

The function implementation of each module in the motion region extraction device corresponds to each step in the motion region extraction method embodiment, and the function and implementation process are not described in detail here.

The present invention also provides a computer-readable storage medium having stored thereon an action region extraction program which, when executed by a processor, implements the steps of the action region extraction method as described in any one of the above embodiments.

The specific embodiment of the computer-readable storage medium of the present invention is substantially the same as the embodiments of the motion region extraction method described above, and will not be described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above, and includes instructions for enabling a device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An action region extraction method, characterized by comprising:

2. The method for extracting an action region according to claim 1, wherein the step of obtaining a video to be clipped and performing feature extraction on the video to be clipped to obtain a first feature sequence comprises:

3. The method as claimed in claim 2, wherein the time sequence information includes an action segment start probability and an action segment end probability corresponding to each target video image, and the step of detecting whether the time sequence information meets a preset condition and obtaining time period information of the action region according to a detection result includes:

4. The method as claimed in any one of claims 1 to 3, wherein before the step of extracting the corresponding action region from the video to be modified according to the time period information, the method further comprises:

5. The operation region extraction method according to claim 4, wherein the step of sampling the feature of each operation region based on the first feature sequence and the time period information to obtain a second feature sequence includes:

6. The method for extracting motion regions according to claim 4, wherein the step of extracting corresponding motion regions from the video to be modified according to the motion region evaluation scores and the time period information comprises:

sequencing the action region evaluation scores in a descending order;

7. An action region extraction device characterized by comprising:

8. The motion region extraction device according to claim 7, further comprising:

9. An action region extraction device characterized by comprising a memory, a processor, and an action region extraction program stored on the memory and executable by the processor, wherein the action region extraction program when executed by the processor implements the steps of the action region extraction method according to any one of claims 1 to 6.

10. A computer-readable storage medium, characterized in that an action region extraction program is stored thereon, wherein the action region extraction program, when executed by a processor, implements the steps of the action region extraction method according to any one of claims 1 to 6.