CN114170554A

CN114170554A - Video detection method, video detection device, storage medium and electronic equipment

Info

Publication number: CN114170554A
Application number: CN202111510590.9A
Authority: CN
Inventors: 张宸; 陈忱; 陶训强; 何苗; 郭彦东
Original assignee: Shanghai Jinsheng Communication Technology Co ltd
Current assignee: Shanghai Jinsheng Communication Technology Co ltd
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-03-11

Abstract

The disclosure provides a video detection method, a video detection device, a computer readable storage medium and an electronic device, and relates to the technical field of video processing. The video detection method comprises the following steps: acquiring a video to be detected and a reference video of a target action; extracting features from the image frames in the video to be detected to obtain a feature sequence to be detected corresponding to the video to be detected, and extracting features from the image frames in the reference video to obtain a reference feature sequence corresponding to the reference video; determining a target sub-characteristic sequence matched with the reference characteristic sequence in the characteristic sequence to be detected; and determining the image frames related to the target action in the video to be detected according to the target sub-feature sequence. The method and the device can conveniently and quickly detect the image frames related to the target action in the video to be detected.

Description

Video detection method, video detection device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of video processing technologies, and in particular, to a video detection method, a video detection apparatus, a computer-readable storage medium, and an electronic device.

Background

With the rapid development of computer technology, image and video data are rapidly growing, and in order to meet the diversified demands of users in various video processing scenes, it is often necessary to detect a video, for example, when a user wants to capture a sub-video of a jump action from a long video, the long video may be subjected to video detection to identify the jump action therein, and capture the sub-video corresponding to the jump action or determine the start time and end time corresponding to the jump action, etc.

In the video detection method in the prior art, a deep neural network model is usually trained in advance based on a large amount of labeled data, and the video to be detected is processed through the deep neural network model to realize the identification of the designated action therein. However, in order to ensure the accuracy of video detection, when the method is adopted, a large number of training videos need to be labeled manually, the time cost and the labor cost are high, and along with the continuous increase of the video detection requirements, the scale of a training video set is also easily limited, so that the performance of a deep neural network model is affected, and the video detection efficiency and the accuracy are poor.

Disclosure of Invention

The present disclosure provides a video detection method, a video detection apparatus, a computer-readable storage medium, and an electronic device, so as to at least improve to some extent the problems that the existing video detection method needs to label a large number of training videos, and the time and labor costs are high.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of the present disclosure, there is provided a video detection method, including: acquiring a video to be detected and a reference video of a target action; extracting features from the image frames in the video to be detected to obtain a feature sequence to be detected corresponding to the video to be detected, and extracting features from the image frames in the reference video to obtain a reference feature sequence corresponding to the reference video; determining a target sub-characteristic sequence matched with the reference characteristic sequence in the characteristic sequence to be detected; and determining the image frames related to the target action in the video to be detected according to the target sub-feature sequence.

According to a second aspect of the present disclosure, there is provided a video detection apparatus comprising: the video acquisition module is used for acquiring a video to be detected and a reference video of a target action; the characteristic extraction module is used for extracting characteristics of image frames in the video to be detected to obtain a characteristic sequence to be detected corresponding to the video to be detected, and extracting characteristics of the image frames in the reference video to obtain a reference characteristic sequence corresponding to the reference video; the sequence determination module is used for determining a target sub-characteristic sequence matched with the reference characteristic sequence in the characteristic sequence to be detected; and the image determining module is used for determining the image frames related to the target action in the video to be detected according to the target sub-feature sequence.

According to a third aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the video detection method of the first aspect described above and possible implementations thereof.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: a processor; a memory for storing executable instructions of the processor. Wherein the processor is configured to execute the video detection method of the first aspect and possible implementations thereof via executing the executable instructions.

The technical scheme of the disclosure has the following beneficial effects:

acquiring a video to be detected and a reference video of a target action; extracting features from image frames in a video to be detected to obtain a feature sequence to be detected corresponding to the video to be detected, and extracting features from the image frames in a reference video to obtain a reference feature sequence corresponding to the reference video; determining a target sub-characteristic sequence matched with the reference characteristic sequence in the characteristic sequence to be detected; and determining image frames related to the target action in the video to be detected according to the target sub-feature sequence. On one hand, the exemplary embodiment provides a new video detection method, by comparing a feature sequence to be detected with a reference feature sequence, an image frame related to a target action is determined in a video to be detected, and since a target sub-feature sequence is obtained based on the reference video matching of the target action, the matching process takes the reference video as a reference, so that a detection result has strong pertinence and accuracy; on the other hand, when matching, the exemplary embodiment only relates to the processing process of the reference video of the target action, and does not need to process a large amount of other training videos.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

FIG. 1 shows a schematic diagram of a system architecture in the present exemplary embodiment;

fig. 2 is a block diagram showing an electronic apparatus in the present exemplary embodiment;

FIG. 3 shows a flow diagram of a video detection method in the present exemplary embodiment;

FIG. 4 shows a schematic image frame diagram of a video detection method in the present exemplary embodiment;

FIG. 5 illustrates a sub-flow diagram of a video detection method in the present exemplary embodiment;

FIG. 6 is a diagram illustrating training of a feature extraction model in the present exemplary embodiment;

FIG. 7 shows a flow diagram of another video detection method in the present exemplary embodiment;

FIG. 8 illustrates another sub-flow diagram of a video detection method in the exemplary embodiment;

fig. 9 shows a block diagram of a video detection apparatus in the present exemplary embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

An exemplary embodiment of the present disclosure provides a video detection method. FIG. 1 shows a system architecture diagram of an environment in which the exemplary embodiment operates. As shown in fig. 1, the system architecture 100 may include a user terminal 110 and a server 120, which may form a communication interaction through a network, for example, the user terminal 110 may send acquired video data to the server 120, and the server 120 may return a video detection result to the user terminal 110. The user terminal 110 may include, but is not limited to, an electronic device such as a smart phone, a tablet computer, a game machine, a wearable device, and the like; the server 120 refers to a background server providing internet services or video processing capability.

It should be understood that the number of devices in fig. 1 is merely exemplary. Any number of user terminals may be provided, or the server may be a cluster formed by a plurality of servers, according to implementation requirements.

The video detection method provided by the embodiment of the present disclosure may be executed by the user terminal 110, for example, after the user terminal 110 collects a video, a video detection process is directly performed; the video detection may also be performed by the server 120, for example, after the user terminal 110 collects a video, the video is uploaded to the server 120, so that the server 120 performs a video detection process, and returns a detection result to the user terminal 110, and the like, which is not limited in this disclosure.

An exemplary embodiment of the present disclosure provides an electronic device for implementing a video detection method, which may be the user terminal 110 or the server 120 in fig. 1. The electronic device comprises at least a processor and a memory for storing executable instructions of the processor, the processor being configured to perform the video detection method via execution of the executable instructions.

The structure of the electronic device is exemplarily described below by taking the mobile terminal 200 in fig. 2 as an example. It will be appreciated by those skilled in the art that the configuration of figure 2 can also be applied to fixed type devices, in addition to components specifically intended for mobile purposes.

As shown in fig. 2, the mobile terminal 200 may specifically include: a processor 210, an internal memory 221, an external memory interface 222, a USB (Universal Serial Bus) interface 230, a charging management Module 240, a power management Module 241, a battery 242, an antenna 1, an antenna 2, a mobile communication Module 250, a wireless communication Module 260, an audio Module 270, a speaker 271, a microphone 272, a microphone 273, an earphone interface 274, a sensor Module 280, a display screen 290, a camera Module 291, a pointer 292, a motor 293, a button 294, and a SIM (Subscriber identity Module) card interface 295.

Processor 210 may include one or more processing units, such as: the Processor 210 may include an AP (Application Processor), a modem Processor, a GPU (Graphics Processing Unit), an ISP (Image Signal Processor), a controller, an encoder, a decoder, a DSP (Digital Signal Processor), a baseband Processor, and/or an NPU (Neural-Network Processing Unit), etc.

The encoder may encode (i.e., compress) the image or video data, for example, encode a to-be-detected video or a reference video obtained after the video quality processing to form corresponding code stream data, so as to reduce the bandwidth occupied by data transmission; the decoder may decode (i.e., decompress) the code stream data of the image or video to restore the image or video data, for example, decode the video to be detected or the reference video to obtain image data of each frame in the video, perform feature extraction on one or more frames of the image, and so on.

In some embodiments, processor 210 may include one or more interfaces through which connections are made to other components of mobile terminal 200.

Internal memory 221 may be used to store computer-executable program code, which includes instructions. The internal memory 221 may include a volatile memory, a non-volatile memory, and the like. The processor 210 executes various functional applications of the mobile terminal 200 and data processing by executing instructions stored in the internal memory 221 and/or instructions stored in a memory provided in the processor.

The external memory interface 222 may be used to connect an external memory, such as a Micro SD card, for expanding the storage capability of the mobile terminal 200. The external memory communicates with the processor 210 through the external memory interface 222 to perform data storage functions, such as storing music, video, and other files.

The USB interface 230 is an interface conforming to the USB standard specification, and may be used to connect a charger to charge the mobile terminal 200, or connect an earphone or other electronic devices.

The charge management module 240 is configured to receive a charging input from a charger. While the charging management module 240 charges the battery 242, the power management module 241 may also supply power to the device; the power management module 241 may also monitor the status of the battery.

The wireless communication function of the mobile terminal 200 may be implemented by the antenna 1, the antenna 2, the mobile communication module 250, the wireless communication module 260, a modem processor, a baseband processor, and the like. The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. The mobile communication module 250 may provide a solution including 2G/3G/4G/5G wireless communication applied on the mobile terminal 200. The Wireless Communication module 260 may provide Wireless Communication solutions applied to the mobile terminal 200, including WLAN (Wireless Local Area Networks ) (e.g., Wi-Fi (Wireless Fidelity, Wireless Fidelity) Networks), BT (Bluetooth), GNSS (Global Navigation Satellite System), FM (Frequency Modulation), NFC (Near Field Communication), IR (Infrared technology), and the like.

The mobile terminal 200 may implement a display function through the GPU, the display screen 290, the AP, and the like, and display a user interface. The mobile terminal 200 may implement a shooting function through the ISP, the camera module 291, the encoder, the decoder, the GPU, the display screen 290, the AP, and the like, and may also implement an audio function through the audio module 270, the speaker 271, the receiver 272, the microphone 273, the earphone interface 274, the AP, and the like.

The sensor module 280 may include a depth sensor 2801, a pressure sensor 2802, a gyroscope sensor 2803, a barometric pressure sensor 2804, etc. to implement different sensing functions.

Indicator 292 may be an indicator light that may be used to indicate a state of charge, a change in charge, or may be used to indicate a message, missed call, notification, etc. The motor 293 may generate a vibration cue, may also be used for touch vibration feedback, and the like. The keys 294 include a power-on key, a volume key, and the like.

The mobile terminal 200 may support one or more SIM card interfaces 295 for connecting to a SIM card to implement functions such as telephony and data communications.

Fig. 3 shows an exemplary flow of a video detection method, which may be executed by the user terminal 110 or the server 120, and includes the following steps S310 to S340:

step S310, a video to be detected and a reference video of the target action are obtained.

The video to be detected is a video which needs to be subjected to target action detection in the video, and the video to be detected can be a video shot by a user in real time or history, can also be a local video, can also be a video downloaded from a cloud or other video sources, and the like. The target action is an action needing to be detected, and according to actual detection requirements, the target action can comprise various types, for example, when the running action needs to be detected in a video to be detected, the target action is the running action; when the jumping motion needs to be detected in the video to be detected, the target motion is the jumping motion and the like. The target action may be one action, such as detecting a running action in the video to be detected, the target action may also be multiple actions, such as detecting a running action or a jumping action in the video to be detected, the target action may also be a combination of the multiple actions, such as detecting a combination action of first performing a running action and then performing a hurdling action in the video to be detected, and the like.

The reference video is an example video including the target action and used for comparing with the video to be detected and detecting whether the video to be detected contains the target action. In the exemplary embodiment, the reference video may be one, for example, when only one target motion, namely a running motion, in the video to be detected needs to be detected, a video containing the running motion may be acquired as the reference video; for example, when two target motions, namely a running motion and a jumping motion, in a video to be detected need to be detected, a video containing the running motion and a video containing the jumping motion can be acquired and used as reference videos; or in order to improve the accuracy of detecting the target motion, when one target motion, i.e., a running motion in the video to be detected needs to be detected, a plurality of reference videos including the running motion may also be obtained, for example, videos of different people performing the running motion, or videos of running motions in different scenes, and the like, which is not specifically limited in this disclosure.

Step S320, extracting features from the image frames in the video to be detected to obtain a to-be-detected feature sequence corresponding to the video to be detected, and extracting features from the image frames in the reference video to obtain a reference feature sequence corresponding to the reference video.

The exemplary embodiment may extract features from image frames in the video to be detected, where the features refer to data capable of reflecting image frame information and are used to represent corresponding image frames, and extract features from image frames in the video to be detected, where the features may be encoding of the image frames, and the features may be feature vectors generated by encoding. And then, generating a characteristic sequence to be detected corresponding to the video to be detected according to the time sequence of the image frame in the video to be detected, wherein when the characteristics of the image frame are characteristic vectors, the corresponding characteristic sequence to be detected is a characteristic vector sequence. The exemplary embodiment may input the video to be detected into a specific encoder or a pre-trained network model to encode the image frame, extract the features of the image frame, and further generate the feature sequence to be detected.

In the exemplary embodiment, when extracting the features of the image frames in the video to be detected, the features of all the image frames in the video to be detected may be extracted, for example, the video to be detected is input into an encoder to perform frame-by-frame encoding on all the image frames in the video to be detected, so as to generate a feature vector of each frame; features can also be extracted from part of the image frames in the video to be detected, for example, when the target action is detected on the video to be detected shot by a user, the target action is considered to appear in the middle section or the middle and rear sections, so that features can be extracted from other image frames of the video to be detected except the image frames with the preset frame number at the head; the image frames in the video to be detected may be subjected to feature extraction and the like according to the requirements of the user, for example, if the user selects to detect a certain section of video in the video to be detected, the image frames specifically subjected to feature extraction may be determined according to actual conditions, which is not specifically limited by the present disclosure.

In addition to extracting features from image frames in a video to be detected to obtain a feature sequence to be detected corresponding to the video to be detected, the exemplary embodiment may also extract features from the image frames in the obtained reference video to obtain a reference feature sequence corresponding to the reference video, and the specific feature extraction manner of the image frames of the reference video and the generation manner of the reference feature sequence may be the same as the feature extraction manner of the image frames in the video to be detected and the generation manner of the feature sequence to be detected, that is, the reference video may also be input into a pre-trained network model to obtain feature vectors of the image frames of the reference video, and further, the reference feature sequence corresponding to the reference video is obtained.

And step S330, determining a target sub-characteristic sequence matched with the reference characteristic sequence in the characteristic sequence to be detected.

The characteristic sequence to be detected reflects a video to be detected, the reference characteristic sequence reflects a reference video containing a target action, whether a target sub-characteristic sequence matched with the reference characteristic sequence is included in the characteristic sequence to be detected or not is searched for, the process that a sub-video related to the target action is determined in the video to be detected can be regarded as a process, and the target sub-characteristic sequence is a sub-sequence formed by the characteristics of image frames possibly related to the target action in the characteristic sequence to be detected.

In this exemplary embodiment, the determining manner of the target sub-feature sequence may include multiple types, specifically, the feature of each frame of image frame in the reference feature sequence may be used as a reference, and the feature sequence to be detected is searched for determining the feature to be detected corresponding to each reference feature, and then the target sub-feature sequence is determined according to the search result; the continuous distribution pattern of the reference feature sequence may be used as a reference, and sub-feature sequences that are the same as or similar to the continuous distribution pattern of the reference feature sequence may be compared and found in the feature sequence to be detected, for example, the exemplary embodiment may determine a size of a sliding window according to a length of the reference feature sequence, then perform sliding calculation in the feature sequence to be detected using the sliding window, determine a matching degree between the feature sequence in the sliding window and the reference feature sequence, and determine a target sub-feature sequence that matches the reference feature sequence in the feature sequence to be detected, where the matching degree may be calculated in various ways, for example, the matching degree between the feature sequence in the sliding window and the reference feature sequence may be calculated using euclidean distance, the matching degree between the feature sequence in the similarity window, and the like. And when the sub-feature sequence matched with the reference feature sequence is not found in the feature sequence to be detected, indicating that the sub-video related to the target action does not exist in the video to be detected.

And step S340, determining image frames related to the target action in the video to be detected according to the target sub-feature sequence.

The target sub-feature sequence is a sequence formed by features of image frames related to the target action in the feature sequence to be detected, and based on each feature in the target sub-feature sequence, the image frames included in the sub-video related to the target action can be determined from the video to be detected, for example, the image frames corresponding to each feature vector can be determined from the video to be detected according to each feature vector in the target sub-feature sequence. According to actual needs, the present exemplary embodiment may determine all image frames related to the target motion from the video to be detected, for example, when the target motion is taken as a jumping motion, determine all image frames included in the corresponding sub-video from the beginning of the jumping motion to the end of the jumping motion from the video to be detected; in addition, partial image frames related to the target motion may be determined from the video to be detected, for example, when the target motion is a jumping motion, an image frame corresponding to the start of the jumping motion and an image frame corresponding to the end of the jumping motion may be determined from the video to be detected. When the target motion includes a plurality of motions, the present exemplary embodiment may output image frames related to the plurality of target motions, classify the target motions for convenience of management, identify the image frames related to each type of target motion, and the like, for example, output 10 image frames related to a jumping motion at 16 th to 25 th frames of the video to be detected and 15 image frames related to a running motion at 31 th to 45 th frames of the video to be detected, and the like.

The exemplary embodiment detects the target action in the video to be detected based on the way of comparing the characteristic sequence to be detected with the reference characteristic sequence, and can still keep higher detection accuracy and efficiency under the condition that people who perform the target action in the video to be detected are partially shielded.

In an exemplary embodiment, the step S340 may include:

and determining the starting frame and the ending frame of the target action in the video to be detected according to the position of the target sub-feature sequence in the feature sequence to be detected.

In practical application, there are often application scenarios that need to detect a video to determine a boundary of a target action, and this exemplary embodiment may determine a start frame and an end frame of the target action in the video to be detected according to a position of a target sub-feature sequence in a feature sequence to be detected, and identify a motion boundary of the target action in the video to be detected by the start frame and the end frame, for example, if the target action is a target sub-feature sequence of a running action in the 16 th frame to the 25 th frame of the feature sequence to be detected, the 16 th frame and the 25 th frame may be determined from the video to be detected, and respectively serve as the start frame and the end frame of the running action, and represent a motion boundary of the running action occurring in the video to be detected by two key frames in a multi-frame image frame of the running action, and the like.

Fig. 4 is a schematic diagram illustrating a video detection method in the present exemplary embodiment, and as shown in fig. 4, taking a target motion as a jump motion as an example for description, first, a video 410 to be detected and a reference video 420 containing the jump motion may be obtained; then extracting features from the image frames in the video 410 to be detected to obtain a feature sequence 430 to be detected corresponding to the video to be detected, and extracting features from the image frames in the reference video to obtain a reference feature sequence 440 corresponding to the reference video; determining a target sub-characteristic sequence matched with the reference characteristic sequence 440 in the characteristic sequence 430 to be detected; finally, all image frames 450 related to the target action in the video 410 to be detected are determined according to the target sub-feature sequence, or the start frame and the end frame 460 of the target action are determined in the video 410 to be detected according to the position of the target sub-feature sequence in the feature sequence 430 to be detected.

To sum up, in the present exemplary embodiment, a reference video of a video to be detected and a target action is obtained; extracting features from image frames in a video to be detected to obtain a feature sequence to be detected corresponding to the video to be detected, and extracting features from the image frames in a reference video to obtain a reference feature sequence corresponding to the reference video; determining a target sub-characteristic sequence matched with the reference characteristic sequence in the characteristic sequence to be detected; and determining image frames related to the target action in the video to be detected according to the target sub-feature sequence. On one hand, the exemplary embodiment provides a new video detection method, by comparing a feature sequence to be detected with a reference feature sequence, an image frame related to a target action is determined in a video to be detected, and since a target sub-feature sequence is obtained based on the reference video matching of the target action, the matching process takes the reference video as a reference, so that a detection result has strong pertinence and accuracy; on the other hand, when matching, the exemplary embodiment only relates to the processing process of the reference video of the target action, and does not need to process a large amount of other training videos.

In an exemplary embodiment, in the step S320, extracting features from image frames in the video to be detected to obtain a feature sequence to be detected corresponding to the video to be detected, includes:

and extracting features of the image frames in the video to be detected by using a pre-trained feature extraction model to obtain a feature sequence to be detected corresponding to the video to be detected.

The exemplary embodiment may train a feature extraction model in advance, and encode the video to be detected through the feature extraction model to extract image features of image frames therein, so as to obtain a feature sequence to be detected corresponding to the video to be detected. The feature extraction model can be a self-supervision deep neural network model trained based on a time cycle consistency algorithm, and can be used for coding learning by inputting paired sample videos for training without providing clear labels for the feature extraction model.

In addition, the feature extraction model may also be another neural network model having an image feature extraction function, for example, when an image to be classified is input into the image classification model, image features of the image to be classified are often extracted through an intermediate layer to generate a feature image, and then the feature image is classified and identified to obtain an image classification result. Therefore, the present exemplary embodiment may also use the image classification model as the feature extraction model, process the image frames in the video to be detected by using the image classification model, and obtain only the feature images output by the intermediate layer, so as to implement the process of extracting features of the image frames in the video to be detected, and so on.

Specifically, in an exemplary embodiment, as shown in fig. 5, the video detection method may further include the following steps:

step S510, obtaining a sample video pair, wherein the sample video pair comprises a first sample video and a second sample video, and the first sample video and the second sample video correspond to the same action;

step S520, respectively extracting features from image frames in the first sample video and the second sample video by using a feature extraction model to be trained to obtain a first sample feature sequence corresponding to the first sample video and a second sample feature sequence corresponding to the second sample video;

step S530, for at least one frame of first sample features in the first sample feature sequence, determining the second sample features most similar to the first sample features in the second sample feature sequence to obtain a first matching result, and determining the first sample features most similar to the second sample features in the first sample feature sequence to obtain a second matching result;

step S540, updating parameters of the feature extraction model according to a difference between the first matching result and the second matching result.

The sample video pair refers to training data for training the feature extraction model, and the sample video pair may include a first sample video and a second sample video of the same motion, for example, the first sample video and the second sample video of a running motion performed in different scenes, or the first sample video or the second sample video of a running motion performed by different people, and the like. When training the feature extraction model, the sample video pair may be used as training data and input into the feature extraction model to be trained, and for example, the sample video pair may be input into the feature extraction model to be trained respectively in a batch or epoch manner. Then, feature extraction is performed on the image frames in the first sample video and the second sample video respectively through a feature extraction model to be trained, for example, feature vectors e of the image frames in the first sample video are extracted respectively₁And a feature vector e of an image frame in the second sample video₂Then, based on the features extracted from the image frames in the first sample video and the second sample video, a first sample feature sequence corresponding to the first sample video and a second sample feature sequence corresponding to the second sample video can be generated, respectively.

Further, for at least one frame of first sample features in the first sample feature sequence, determining a second sample feature most similar to the first sample feature in the second sample feature sequence to obtain a first matching result, then determining a first sample feature most similar to the second sample feature in the first sample feature sequence to obtain a second matching result, that is, firstly finding an image frame similar to the image frame in the first sample video in the second sample video, then finding an image frame most similar to the image frame in the first sample video, so as to loop, updating parameters of the feature extraction model by a difference between the first matching result and the second matching result, that is, a difference between the image frame found in the second sample video and the image frame found in the first sample video, specifically, adjusting the parameters of the feature extraction model to make the difference between the first matching result and the second matching result smaller and smaller, and obtaining the feature extraction model after the parameters are finally updated until the accuracy of the feature extraction model reaches a certain standard or a convergence condition. Based on the mode, the training process of the feature extraction model can be realized. Fig. 6 shows a training diagram of a feature extraction model, which may specifically include: obtaining a sample video pair, the sample video pair comprising a first sample video 610 and a second sample video 620, the first sample video 610 and the second sample video 620 corresponding to a same action; respectively extracting the features of the image frames in the first sample video 610 and the second sample video 620 through a feature extraction model 630 to be trained; and constructing a time domain cycle alignment loss function 640 based on a time cycle consistency algorithm, and updating parameters of the feature extraction model to obtain the trained feature extraction model.

Fig. 7 shows a flowchart of another video detection method, which may specifically include the following steps:

step S710, acquiring a video to be detected;

step S720, acquiring a reference video containing a target action;

step 730, extracting features of image frames in a video to be detected by using a pre-trained feature extraction model to obtain a feature sequence to be detected corresponding to the video to be detected;

step S740, extracting features from image frames in a reference video by using a pre-trained feature extraction model to obtain a reference feature sequence corresponding to the reference video;

step S750, judging whether a target sub-characteristic sequence matched with the reference characteristic sequence exists in the characteristic sequence to be detected;

if the target characteristic sequence matched with the reference characteristic sequence exists in the characteristic sequence to be detected, execution is carried out

Step S760, determining the category of the target action, and the starting frame and the ending frame of the target action in the video to be detected according to the position of the target sub-feature sequence in the feature sequence to be detected;

if the target characteristic sequence matched with the reference characteristic sequence does not exist in the characteristic sequence to be detected, execution is carried out

And step S770, returning a detection result that the target characteristic sequence matched with the reference characteristic sequence does not exist in the characteristic sequence to be detected.

In an exemplary embodiment, as shown in fig. 8, the step S330 may include:

step S810, determining the size of a sliding window according to the length of the reference characteristic sequence;

and S820, extracting a sub-feature sequence from the feature sequence to be detected by using a sliding window, determining the matching degree of the sub-feature sequence and the reference feature sequence, and determining the sub-feature sequence as a target sub-feature sequence when the matching degree reaches a preset threshold value.

In the exemplary embodiment, the sliding window may be set to slide in the video to be detected, so as to implement the matching process of the feature sequence. Specifically, the size of the sliding window may be determined according to the length of the reference feature sequence, for example, the reference feature sequence is a sequence formed by features of 10 frames of images, and the size of the sliding window may be set to 10 frames, or may be set to less than 10 frames, such as 9 frames or 8 frames. Then, a sliding window can be adopted to slide in the characteristic sequence to be detected, each step of sliding can determine a segment of sub-characteristic sequence corresponding to the current sliding window, and further, the target sub-characteristic sequence can be determined by calculating the matching degree of the sub-characteristic sequence and the reference characteristic sequence. The step length of the sliding window can be set by self according to the speed requirement and the accuracy requirement of the equalizing system, and the method is not particularly limited in the present disclosure. The matching degree between the reference feature sequence and the sub-feature sequence can be realized by calculating the similarity, for example, calculating the similarity between the feature vector in the reference feature sequence and the feature vector in the sub-feature sequence by means of cosine similarity or Euclidean distance, and when the matching degree reaches a preset threshold, for example, the Euclidean distance is smaller than a preset threshold, the current matching degree is considered to meet a certain requirement, and the sub-feature sequence can be determined as the target sub-feature sequence.

In an exemplary embodiment, in the step S820, extracting the sub-feature sequence from the feature sequence to be detected by using a sliding window may include the following steps:

determining at least one reference feature frame in the reference feature sequence;

determining a characteristic frame to be detected which is most similar to the reference characteristic frame in the characteristic sequence to be detected, and determining the initial position of the sliding window in the characteristic sequence to be detected according to the position of the characteristic frame to be detected;

the sliding window is placed at an initial position and the sub-feature sequences located within the sliding window are extracted.

In the exemplary embodiment, in order to improve the efficiency and accuracy of the sliding window in the feature sequence to be detected, the initial position of the sliding window may be specifically set. Specifically, at least one reference feature frame may be first determined in the reference feature sequence, and the reference feature frame may be a first frame or a last frame image frame corresponding to the reference feature sequence. Then, the feature frame to be detected which is most similar to the reference feature frame is found in the feature sequence to be detected, for example, the feature sequence to be detected is a feature sequence generated by a video containing a jumping action, the reference feature sequence is a video containing a jumping action, the reference feature frame can be set as an image frame in which the foot of the person just separated from the ground just begins in the jumping action, since the feature sequence to be detected includes, in addition to the video segment of the jumping motion of the person, other video segments unrelated to the jumping motion, at this time, if the sliding window is slid from the head of the feature sequence to be detected, a large number of invalid matching calculations may be generated, and in order to improve the detection efficiency of the sliding window in the feature sequence to be detected, the feature frame to be detected which is most similar to the reference feature frame can be found in the feature sequence to be detected, and if the image frame of the person just separated from the ground by the steps is found to be used as a characteristic frame to be detected, and the like. The specific search mode may be determined by calculating a similarity between the feature frames, for example, the similarity between the reference feature frame and the image frame in the feature sequence to be detected may be determined by cosine similarity or euclidean distance, so as to determine the feature frame to be detected that is most similar to the reference feature frame according to the similarity. Further, the initial position of the sliding window in the feature sequence to be detected can be determined according to the position of the feature frame to be detected, for example, the position of the feature frame to be detected can be determined as the initial position of the sliding window in the feature sequence to be detected; or determining the position of the characteristic frame to be detected as the end position of the sliding window in the characteristic sequence to be detected, and determining the initial position by backward pushing according to the preset step length. And finally, gradually moving the sliding window from the determined initial position, extracting the sub-feature sequence in the sliding window after each movement, for example, taking a first frame in the reference feature sequence as a reference feature frame, finding a feature frame to be detected which is most similar to the reference feature frame in the feature sequence to be detected, moving the sliding window according to a preset step length by taking the position of the feature frame to be detected as the initial position of the sliding window, and extracting the sub-feature sequence in the current sliding window after each movement of one step length.

In an exemplary embodiment, the reference video of the target action may include a plurality of reference videos. When the target action includes a plurality of reference videos, a problem of inconsistent reference video lengths may occur, and based on this, the present exemplary embodiment may determine the size of the sliding window in various ways, for example, may determine the size of the sliding window according to a calculated average value by calculating an average value of different reference video lengths. The size of the sliding window may also be determined by comparing a plurality of reference feature sequences corresponding to a plurality of reference videos, and specifically, the step S510 may include:

comparing a plurality of reference feature sequences corresponding to a plurality of reference videos to determine outlier feature frames in the reference feature sequences;

and removing the outlier characteristic frame from the reference characteristic sequence, and determining the size of the sliding window according to the length of the reference characteristic sequence after the outlier characteristic frame is removed.

In order to ensure the accuracy and effectiveness of the reference feature sequences, when the lengths of a plurality of reference videos are inconsistent, a plurality of reference feature sequences corresponding to the plurality of reference videos can be compared, and the outlier feature frames in the reference feature sequences are determined first. Then, the outlier characteristic frame is removed from the reference characteristic frame, and the size of the sliding window is determined based on the length of the reference characteristic sequence after the outlier characteristic frame is removed, so that the effectiveness of the remaining characteristic frame is ensured, and the accuracy of characteristic sequence matching is further ensured.

It should be noted that the size of the sliding window may be flexibly set according to actual requirements or reference videos, for example, the reference videos of different types of target actions may be set to have different sizes of the sliding window, which is not specifically limited in this disclosure.

Exemplary embodiments of the present disclosure also provide a video detection apparatus. As shown in fig. 9, the video detection apparatus 900 may include: the video acquiring module 910 is configured to acquire a video to be detected and a reference video of a target action; the feature extraction module 920 is configured to extract features from image frames in a video to be detected to obtain a feature sequence to be detected corresponding to the video to be detected, and extract features from image frames in a reference video to obtain a reference feature sequence corresponding to the reference video; a sequence determining module 930, configured to determine a target sub-feature sequence matching the reference feature sequence in the feature sequences to be detected; and an image determining module 940, configured to determine, according to the target sub-feature sequence, an image frame related to the target action in the video to be detected.

In an exemplary embodiment, the feature extraction module includes: and the feature extraction unit is used for extracting features from the image frames in the video to be detected by using the pre-trained feature extraction model to obtain a feature sequence to be detected corresponding to the video to be detected.

In an exemplary embodiment, the video detection apparatus further includes: the system comprises a sample video acquisition module, a motion estimation module and a motion estimation module, wherein the sample video acquisition module is used for acquiring a sample video pair, the sample video pair comprises a first sample video and a second sample video, and the first sample video and the second sample video correspond to the same motion; the sample feature extraction module is used for respectively extracting features from image frames in the first sample video and the second sample video by using a feature extraction model to be trained to obtain a first sample feature sequence corresponding to the first sample video and a second sample feature sequence corresponding to the second sample video; a matching result obtaining module, configured to, for at least one frame of first sample features in the first sample feature sequence, determine a second sample feature that is most similar to the first sample feature in the second sample feature sequence to obtain a first matching result, and determine a first sample feature that is most similar to the second sample feature in the first sample feature sequence to obtain a second matching result; and the model parameter updating module is used for updating the parameters of the feature extraction model according to the difference between the first matching result and the second matching result.

In an exemplary embodiment, the sequence determination module includes: a length determination unit for determining the size of the sliding window according to the length of the reference feature sequence; and the target sub-feature sequence determining unit is used for extracting a sub-feature sequence from the feature sequence to be detected by using a sliding window, determining the matching degree of the sub-feature sequence and the reference feature sequence, and determining the sub-feature sequence as the target sub-feature sequence when the matching degree reaches a preset threshold value.

In an exemplary embodiment, the target sub-feature sequence determination unit includes: a reference feature frame determining subunit, configured to determine at least one reference feature frame in the reference feature sequence; the initial position determining subunit is used for determining the characteristic frame to be detected which is most similar to the reference characteristic frame in the characteristic sequence to be detected, and determining the initial position of the sliding window in the characteristic sequence to be detected according to the position of the characteristic frame to be detected; and the sub-feature sequence extraction subunit is used for placing the sliding window at the initial position and extracting the sub-feature sequences positioned in the sliding window.

In an exemplary embodiment, the reference video of the target action includes a plurality of reference videos; the target sub-feature sequence determination unit includes: the outlier characteristic frame determining subunit is used for comparing a plurality of reference characteristic sequences corresponding to a plurality of reference videos to determine an outlier characteristic frame in the reference characteristic sequences; and the outlier feature frame removing subunit is used for removing the outlier feature frame from the reference feature sequence and determining the size of the sliding window according to the length of the reference feature sequence from which the outlier feature frame is removed.

In an exemplary embodiment, the image determination module includes: and the image frame determining unit is used for determining a starting frame and an ending frame of the target action in the video to be detected according to the position of the target sub-feature sequence in the feature sequence to be detected.

The specific details of each part in the above device have been described in detail in the method part embodiments, and thus are not described again.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium, which may be implemented in the form of a program product, including program code, for causing a terminal device to perform the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned "exemplary method" section of this specification, when the program product is run on the terminal device, for example, any one or more of the steps in fig. 3, fig. 5, fig. 7 or fig. 8 may be performed. The program product may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a random access memory, a Read Only Memory (ROM), an erasable programmable read only memory (EPROM or flash memory), an optical fiber, a portable compact disc read only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the following claims.

Claims

1. A video detection method, comprising:

acquiring a video to be detected and a reference video of a target action;

extracting features from the image frames in the video to be detected to obtain a feature sequence to be detected corresponding to the video to be detected, and extracting features from the image frames in the reference video to obtain a reference feature sequence corresponding to the reference video;

determining a target sub-characteristic sequence matched with the reference characteristic sequence in the characteristic sequence to be detected;

and determining the image frames related to the target action in the video to be detected according to the target sub-feature sequence.

2. The method according to claim 1, wherein the extracting features from the image frames in the video to be detected to obtain a feature sequence to be detected corresponding to the video to be detected comprises:

3. The method of claim 2, further comprising:

obtaining a sample video pair, the sample video pair comprising a first sample video and a second sample video, the first sample video and the second sample video corresponding to a same action;

respectively extracting features from image frames in the first sample video and the second sample video by using the feature extraction model to be trained to obtain a first sample feature sequence corresponding to the first sample video and a second sample feature sequence corresponding to the second sample video;

for at least one frame of first sample features in the first sample feature sequence, determining a second sample feature which is most similar to the first sample features in the second sample feature sequence to obtain a first matching result, and determining a first sample feature which is most similar to the second sample features in the first sample feature sequence to obtain a second matching result;

and updating the parameters of the feature extraction model according to the difference between the first matching result and the second matching result.

4. The method according to claim 1, wherein the determining of the target sub-signature sequence matching the reference signature sequence in the to-be-detected signature sequence comprises:

determining the size of a sliding window according to the length of the reference characteristic sequence;

and extracting a sub-feature sequence from the feature sequence to be detected by using the sliding window, determining the matching degree of the sub-feature sequence and the reference feature sequence, and determining the sub-feature sequence as the target sub-feature sequence when the matching degree reaches a preset threshold value.

5. The method according to claim 4, wherein the extracting sub-feature sequences from the feature sequences to be detected by using the sliding window comprises:

determining a feature frame to be detected which is most similar to the reference feature frame in the feature sequence to be detected, and determining an initial position of the sliding window in the feature sequence to be detected according to the position of the feature frame to be detected;

and placing the sliding window at the initial position and extracting the sub-feature sequences positioned in the sliding window.

6. The method of claim 4, wherein the reference video of the target action comprises a plurality of reference videos; the determining the size of the sliding window according to the length of the reference feature sequence includes:

comparing a plurality of reference feature sequences corresponding to the plurality of reference videos to determine outlier feature frames in the reference feature sequences;

7. The method according to claim 1, wherein the determining image frames related to the target action in the video to be detected according to the target sub-feature sequence comprises:

and determining a starting frame and an ending frame of the target action in the video to be detected according to the position of the target sub-feature sequence in the feature sequence to be detected.

8. A video detection apparatus, comprising:

the video acquisition module is used for acquiring a video to be detected and a reference video of a target action;

the characteristic extraction module is used for extracting characteristics of image frames in the video to be detected to obtain a characteristic sequence to be detected corresponding to the video to be detected, and extracting characteristics of the image frames in the reference video to obtain a reference characteristic sequence corresponding to the reference video;

the sequence determination module is used for determining a target sub-characteristic sequence matched with the reference characteristic sequence in the characteristic sequence to be detected;

and the image determining module is used for determining the image frames related to the target action in the video to be detected according to the target sub-feature sequence.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 7.

10. An electronic device, comprising:

a processor;

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1 to 7 via execution of the executable instructions.