CN109977262B

CN109977262B - Method and device for acquiring candidate segments from video and processing equipment

Info

Publication number: CN109977262B
Application number: CN201910231596.9A
Authority: CN
Inventors: 卢江虎; 姚聪; 刘小龙; 孙宇超
Original assignee: Beijing Kuangshi Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd
Priority date: 2019-03-25
Filing date: 2019-03-25
Publication date: 2021-11-16
Anticipated expiration: 2039-03-25
Also published as: CN109977262A

Abstract

The invention provides a method, a device and processing equipment for acquiring candidate segments from a video, which relate to the technical field of motion detection, and the method comprises the following steps: acquiring a video to be detected; respectively calculating the image similarity between adjacent video frames of the video to be detected by a preset similarity algorithm to obtain a similarity sequence; the sequence of the image similarity in the similarity sequence is the same as the sequence of the video frames; taking the image similarity which is greater than the first segmentation threshold value in the similarity sequence as the target image similarity; and if the arrangement sequence of the similarity of the target images in the similarity sequence is continuous, taking the video frames corresponding to the similarity of the target images as candidate segments of the video to be detected. The method, the device and the processing equipment for acquiring the candidate segment from the video provided by the embodiment of the invention can produce more accurate candidate segments, and the candidate segments have good robustness and are suitable for various video motion detection models.

Description

Method and device for acquiring candidate segments from video and processing equipment

Technical Field

The present invention relates to the field of motion detection technologies, and in particular, to a method, an apparatus, and a processing device for acquiring a candidate segment from a video.

Background

The video action detection means detecting whether a specific target action exists in a target video, and if the specific target action exists in the video, determining the starting time and the ending time of the target action. With the explosive growth of the number of videos, video motion detection is applied in increasingly wide fields including pedestrian surveillance, autopilot, short video segmentation, and the like.

The video motion detection is not ideal due to the large difference in duration of different motions and the wide variety of motions. The existing mainstream video motion detection methods firstly produce segments which may contain motion, and then train a classification network to classify the segments, however, the following problems exist: if the similarity between the background and the foreground of the video is high, the distinguishing capability of extracting the features is not strong, so that the positioning of the action boundary is not accurate; the general capability of the classification network is poor, generally, the data set is fitted forcibly, the classification precision of other data sets is poor, and parameters need to be adjusted again.

In view of the above problems of video motion detection in the prior art, no effective solution has been proposed at present.

Disclosure of Invention

In view of this, an object of the present invention is to provide a method, an apparatus and a processing device for obtaining candidate segments from a video, which can produce more accurate candidate segments, have good robustness, and are suitable for various video motion detection models.

In a first aspect, an embodiment of the present invention provides a method for acquiring a candidate segment from a video, including: acquiring a video to be detected; respectively calculating the image similarity between adjacent video frames of the video to be detected by a preset similarity algorithm to obtain a similarity sequence; wherein the ordering of image similarities in the sequence of similarities is the same as the ordering of the video frames; taking the image similarity which is greater than a first segmentation threshold value in the similarity sequence as a target image similarity; and if the arrangement sequence of the target image similarities in the similarity sequence is continuous, taking the video frames corresponding to the target image similarities as the candidate segments of the video to be detected.

Further, the step of using the video frames corresponding to the similarity of the plurality of target images as the candidate segments of the video to be detected includes: taking the first video frame corresponding to the similarity of the target images as a starting frame of the candidate segment, and taking the last video frame corresponding to the similarity of the target images as an ending frame of the candidate segment; and segmenting the segment between the starting frame and the ending frame from the video to be detected to obtain a candidate segment.

Further, the image similarity in the similarity sequence is provided with index identification; if the arrangement sequence of the target image similarities in the similarity sequence is continuous, the step of taking the video frames corresponding to the target image similarities as the candidate segments of the video to be detected comprises the following steps: judging whether the index marks adjacent to the image similarity are continuous or not; if so, judging whether the continuous index identification is larger than a preset quantity threshold value;

and if the number of the video frames is larger than the preset number threshold, taking the video frames corresponding to the continuous index marks as candidate segments of the video to be detected.

Further, after obtaining the candidate segment, the method further comprises: taking the image similarity which is greater than a second segmentation threshold value in the similarity sequence corresponding to the candidate segment as the detail image similarity; the second segmentation threshold is greater than the first segmentation threshold; if the arrangement sequence of the multiple sub-image similarities in the similarity sequence is continuous, taking the video frames corresponding to the multiple sub-image similarities as the first type sub-divided candidate segments of the candidate segments; and taking other segments of the candidate segments divided by the subdivision candidate segments as second-class subdivided candidate segments.

Further, the step of using the video frames corresponding to the similarity of the plurality of subdivided images as the candidate segments of the first type subdivision of the candidate segments includes: and taking the first video frame corresponding to the similarity of the plurality of subdivided images as a starting frame of the subdivided candidate segment, taking the last video frame corresponding to the similarity of the plurality of subdivided images as an ending frame of the subdivided candidate segment, and segmenting the candidate segment to obtain the subdivided candidate segment.

Further, after obtaining the subdivided candidate segments, the method further comprises: selecting one of said subdivided candidate segments among adjacent said candidate segments, respectively; and taking the first video frame of the previous subdivided candidate segment as the starting frame of the lengthened candidate segment, taking the last video frame of the subsequent subdivided candidate segment as the ending frame of the lengthened candidate segment, and segmenting the video to be detected to obtain the lengthened candidate segment.

Further, the method further comprises: setting a sequencing loss function based on the overlapping degree of the two candidate segments and the correct marked segment; the overlapping degree of the two candidate segments and the correct labeling segment is different; and taking the sequencing loss function as a loss function of a video motion detection model, and training the video motion detection model through the candidate segments.

Further, the method further comprises: and performing motion detection on the candidate segments through a pre-configured video motion detection model.

In a second aspect, an embodiment of the present invention provides an apparatus for acquiring a candidate segment from a video, including: the acquisition module is used for acquiring a video to be detected; the calculation module is used for respectively calculating the image similarity between the adjacent video frames of the video to be detected through a preset similarity calculation method to obtain a similarity sequence; wherein the ordering of image similarities in the sequence of similarities is the same as the ordering of the video frames; the searching module is used for taking the image similarity which is greater than a first segmentation threshold value in the similarity sequence as the target image similarity; and the segmentation module is used for taking the video frames corresponding to the similarity degrees of the target images as the candidate segments of the video to be detected if the arrangement sequence of the similarity degrees of the target images in the similarity sequence is continuous.

In a third aspect, an embodiment of the present invention provides a processing device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method according to any one of the first aspect when executing the computer program.

In a fourth aspect, an embodiment of the present invention provides a computer-readable medium having non-volatile program code executable by a processor, where the program code causes the processor to perform the steps of the method according to any one of the first aspect.

According to the method, the device and the processing equipment for acquiring the candidate segments from the video, provided by the embodiment of the invention, the image similarity between adjacent video frames of the video to be detected is respectively calculated through a preset similarity calculation method to obtain a similarity sequence, the sequence of the image similarity in the similarity sequence is the same as the sequence of the video frames, and then the video frames which are larger than a first segmentation threshold value and correspond to the continuous image similarity in the similarity sequence are taken as the candidate segments of the video to be detected.

Additional features and advantages of the disclosure will be set forth in the description which follows, or in part may be learned by the practice of the above-described techniques of the disclosure, or may be learned by practice of the disclosure.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic structural diagram of a processing apparatus according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for obtaining candidate segments from a video according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a process for training a model using a training Loss according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a process for generating candidate fragments using SSIM sequences according to an embodiment of the present invention;

fig. 5 is a verification result of a video motion detection model according to an embodiment of the present invention;

fig. 6 is a block diagram illustrating a structure of an apparatus for acquiring a candidate segment from a video according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the existing video motion detection method, the following problems exist in the process of producing segments possibly containing motions: 1. the positioning of the action boundaries of the segments is inaccurate; 2. the generalization ability is poor, and the fragments obtained by forced fitting cannot be applied to other data sets. Based on this, embodiments of the present invention provide a method, an apparatus, and a processing device for acquiring a candidate segment from a video, which are described in detail below by embodiments of the present invention.

The first embodiment is as follows:

first, a processing device 100 for implementing embodiments of the present invention, which may be used to execute methods of embodiments of the present invention, is described with reference to fig. 1.

As shown in FIG. 1, processing device 100 includes one or more processors 102, one or more memories 104, input devices 106, output devices 108, and a data collector 110, which are interconnected via a bus system 112 and/or other form of connection mechanism (not shown). It should be noted that the components and configuration of the processing device 100 shown in FIG. 1 are exemplary only, and not limiting, and that the processing device may have other components and configurations as desired.

The processor 102 may be implemented in at least one hardware form of a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), and an asic (application Specific Integrated circuit), the processor 102 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capability and/or instruction execution capability, and may control other components in the processing device 100 to perform desired functions.

The memory 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 102 to implement client-side functionality (implemented by the processor) and/or other desired functionality in embodiments of the invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like.

The data collector 110 is configured to collect data, where the data collected by the data collector is original data of a current target or target data, and then the data collector may further store the original data or the target data in the memory 104 for use by other components.

Exemplarily, a processing device for implementing the method for acquiring candidate segments from a video according to an embodiment of the present invention may be implemented as a smart terminal such as a server, a smart phone, a tablet computer, a computer, or the like.

Example two:

an embodiment of the present invention provides a method for acquiring a candidate segment from a video by using an image processing method, and referring to a flowchart of a method for acquiring a candidate segment from a video shown in fig. 2, the method may be executed by the processing device provided in the foregoing embodiment, and the method may include the following steps:

step S202, a video to be detected is obtained.

The method for acquiring candidate segments from a video according to this embodiment is to extract a plurality of candidate segments (proposals) from a video to be detected, and the video can be further detected based on the candidate segments.

And step S204, respectively calculating the image similarity between adjacent video frames of the video to be detected by a preset similarity algorithm to obtain a similarity sequence. Wherein the image similarity in the similarity sequence is ordered in the same way as the video frames.

The preset similarity algorithm may be used to measure the similarity between two images, in this embodiment, the similarity between two adjacent images in a video is measured, and it may be determined whether the two adjacent images include continuous motion according to the image similarity, so as to perform subsequent video segmentation accordingly. The similarity algorithm may be implemented by mean-square error (MSE), structural similarity (ssim) or peak Signal-to-Noise ratio (psnr), for example. After the image similarity between all the adjacent two frames of images of the video to be detected is obtained through calculation, all the similarities are arranged according to the sequence of the corresponding images appearing in the video, and a similarity sequence can be obtained. The sequence of the image similarity in the finally obtained similarity sequence is the same as the sequence of the corresponding video frames.

And step S206, taking the image similarity which is greater than the first segmentation threshold value in the similarity sequence as the target image similarity.

The similarity greater than the first segmentation threshold indicates that two adjacent frames of images include continuous actions, and the similarity less than the first segmentation threshold indicates that two adjacent frames of images do not include continuous actions, so that the initial image and the ending image of a certain action in the video can be found according to the comparison result of the similarity and the first segmentation threshold.

Step S208, if the arrangement sequence of the target image similarities in the similarity sequence is continuous, the video frames corresponding to the target image similarities are used as candidate segments of the video to be detected.

Because the image similarity represents the image similarity between adjacent video frames, and the arrangement sequence of the target image similarities in the similarity sequence is continuous, it can be determined that the video frames corresponding to the target image similarities contain continuous actions, and the video frames corresponding to the target image similarities need to be segmented, so as to obtain the candidate segments of the video to be detected.

According to the method for acquiring the candidate segments from the video, provided by the embodiment of the invention, the image similarity between adjacent video frames of the video to be detected is respectively calculated through a preset similarity calculation method to obtain a similarity sequence, the sequence of the image similarity in the similarity sequence is the same as the sequence of the video frames, and then the video frames which are greater than a first segmentation threshold value and correspond to continuous image similarities in the similarity sequence are taken as the candidate segments of the video to be detected.

After the similarity sequence is obtained, a plurality of continuous image similarities can be selected from the similarity sequence, and the corresponding video segments are segmented, that is, the candidate segments, wherein the video frames corresponding to the target image similarities are used as the candidate segments of the video to be detected, and the method can be implemented in the following manner: and taking the first video frame corresponding to the similarity of the target images as the start frame of the candidate segment, taking the last video frame corresponding to the similarity of the target images as the end frame of the candidate segment, and segmenting the segment from the start frame to the end frame from the video to be detected to obtain the candidate segment. Since the image similarity refers to the similarity between adjacent frame images, each image similarity corresponds to two images, and therefore, the start frame indicates the previous image of the two images corresponding to the target image similarity, and the end frame indicates the target image similarity corresponding to the next image of the two images.

In order to facilitate the screening of the similarity sequence to obtain the similarity of consecutive images, an index identifier may be set for the similarity of images in the similarity sequence, and the order of the index identifier is also the same as the order of video frames, for example, the order may be the sequence number of frame images. When determining the candidate segment of the video to be detected, it may be determined whether the index identifiers of the similarity of the adjacent images are consecutive, and if the sequence numbers are taken as an example, it may be determined whether the difference value of the sequence numbers of the similarity of the adjacent images is 1. If the index identifications are continuous, whether the continuous index identifications are larger than a preset number threshold value is continuously judged, and the purpose is to eliminate the adverse effect of too few continuous segments on the action detection. And if the number of the video frames is larger than the preset number threshold, taking the video frames corresponding to the continuous index marks as candidate segments of the video to be detected.

In order to obtain candidate segments with more accurate positioning boundaries, each of the obtained candidate segments can be further segmented to generate more detailed segmented segments. The above method may therefore further comprise:

(1) taking the image similarity which is greater than a second segmentation threshold in the similarity sequence corresponding to the candidate segments as the detail image similarity, wherein the second segmentation threshold is greater than the first segmentation threshold;

(2) and if the arrangement sequence of the multiple subdivided image similarities in the similarity sequence is continuous, taking the video frames corresponding to the multiple subdivided image similarities as the first type subdivided candidate segments of the candidate segments. Similar to the foregoing segmentation process, the first video frame corresponding to the multiple subdivided image similarities may be used as a starting frame of the subdivided candidate segment, the last video frame corresponding to the multiple subdivided image similarities may be used as an ending frame of the subdivided candidate segment, and the subdivided candidate segment may be obtained by segmenting the candidate segment. By increasing the segmentation threshold, finer candidate segments can be segmented from the original candidate segments, and finally, the precision of motion detection is improved.

(3) And taking other segments divided by the subdivided candidate segments in the candidate segments as the second type of subdivided candidate segments. In the step (2), the segmented part of the original candidate segment is used as a finer candidate segment, and the original candidate segment further includes at least one remaining other segment, which is also used as a finer candidate segment.

In order to generate candidate segments with different lengths, each of the obtained candidate segments may be further recombined to generate segmented segments with different lengths. The above method may further comprise:

(1) a subdivided candidate segment is selected among the neighboring candidate segments, respectively. And selecting a subdivided candidate segment from the two adjacent candidate segments, wherein the position of the subdivided candidate segment in the corresponding candidate segment is not limited.

(2) And taking the first video frame of the previous subdivided candidate segment as the starting frame of the lengthened candidate segment, taking the last video frame of the subsequent subdivided candidate segment as the ending frame of the lengthened candidate segment, and segmenting the video to be detected to obtain the lengthened candidate segment. When two subdivided candidate segments are connected together, the video frame in between is also comprised of only the lengthened candidate segment. Based on different positions of the subdivided candidate segments in the corresponding candidate segments, the candidate segments with various lengths can be obtained, so that the number of samples for training or detection is enriched.

When the candidate segment is obtained, a video motion detection model may be trained or a pre-configured video motion detection model may be used to perform motion detection on the candidate segment. In the training process, in order to improve the accuracy of the model, the sequence information of the candidate segments in the video to be detected is included, so that the method is different from the candidate segments with different real action segment overlapping degrees. Based on the above idea, the above method further comprises the steps of:

(1) setting a sequencing loss function based on the overlapping degree of the two candidate segments and the correct marked segment; the two candidate segments have different degrees of overlap with the correctly labeled segment. (2) And taking the sequencing loss function as a loss function of the video motion detection model, and training the video motion detection model through the candidate segments.

Most of the existing methods use a cross entropy loss training deep learning model to obtain a video motion detection model, and then classify candidate segments, which ignores relationship information among the candidate segments. For both candidate segments, their scores are higher due to the accuracy of the deep learning model. If in trainingAnd adding the sequence information of the candidate segments in the video, so that the score of a good candidate segment can be higher than that of a poor candidate segment, and the accuracy of the model can be greatly improved. An ordering Loss function (Ranking Loss) can be added on the basis of cross entropy Loss when training the model. Suppose that the overlap of the two candidate segments and the correctly labeled action segment (ground-route) is c_p,c_qWithout loss of generality, assume c_p>c_qThen, the rank penalty function can be set at the time of training as follows:

l_rank＝max(0,c_q-c_p+ε)

see FIG. 3 for a schematic diagram of a process for training a model using a Ranking Loss model, where ψ₁、ψ₂、ψ₃Are 3 different candidate fragments, C₁、C₂、C₃Respectively represent psi₁、ψ₂、ψ₃Degree of overlap with correctly labeled segments, i.e. candidate segments psi during model training₁、ψ₂、ψ₃Respectively corresponding scores. The goal of model training is if C₁、C₂、C₃Two by two are sequenced, and the corresponding sequence is C₁＞C₂＞C₃。

The following embodiment will be described taking SSIM as an example of video segmentation. There is a strong correlation between adjacent pictures, and the formula is as follows:

where x and y represent two images, μ_xAnd mu_yIs the mean value, σ_xAnd σ_yIs its standard deviation, σ_xyIs the covariance of the two pictures, C₁And C₂Is a constant. SSIM compares the brightness, contrast and structural similarity of two pictures. By using the SSIM similarity sequence, rich candidate fragments can be generated by a segmentation strategy and a fusion strategy, as follows:

(1) and (3) a segmentation strategy: a binary vector is generated using the segmentation threshold θ for the SSIM sequence S. And setting the similarity smaller than or equal to the segmentation threshold as a vector 1, and setting the similarity larger than the segmentation threshold as a vector 0, wherein the vector 1 represents the boundary of the candidate segment, and 0 represents the inside of the candidate segment.

And collecting all indexes of vector 1 for the binary vectors of the similarity to obtain B ═ i, x_iNot equal to 0}, wherein x_iFrom B (S, θ).

(2) Fusion strategy: connecting the indexes with the vectors being 1 to obtain the candidate segments of the video

Wherein x_iFrom B, δ is the connectivity, and T is the length of B.

The initial candidate segment phi of the video can be obtained by using the segmentation strategy and the fusion strategy_ini. For more precise location of boundaries, for Φ_iniEach segment in the system continues to perform a segmentation strategy and a fusion strategy, so that more detailed candidate segments phi can be obtained_det. To yield candidate segments of different lengths, a longer segment Φ can be generated based on all boundary indices in two adjacent candidate segments_comFinally, all candidate segments may be collected as the final candidate segment of a video, as follows:

Φ_V＝Φ_ini∪Φ_det∪Φ_com

see FIG. 4 for a schematic illustration of a process for generating candidate fragments using SSIM sequences, wherein an initial candidate fragment, e.g., x₁ ⁰-x₂ ⁰、x₃ ⁰-x₄ ⁰Resulting subdivided candidate segments, e.g. x₃ ⁰-x₁ ¹、x₁ ¹-x₄ ⁰Resulting in a lengthened candidate segment such as x₂ ⁰-x₃ ⁰-x₁ ¹、x₃ ⁰-x₁ ¹-x₄ ⁰。

The video motion detection model obtained by training by combining the SSIM sequence and the aforementioned Ranking Loss is shown in fig. 5, and the verification result is far better than that of the existing method, where the first line in the a-diagram and b-diagram of fig. 5 is a correctly labeled candidate segment, the second line is a better candidate segment, and the third line is a poorer candidate segment, and it can be seen from fig. 5 that the scores of the two are greatly different, and the Ranking Loss successfully suppresses the poorer candidate segment.

Example three:

for the image processing method provided in the second embodiment, an embodiment of the present invention provides an apparatus for acquiring a candidate segment from a video, and referring to a block diagram of a structure of an apparatus for acquiring a candidate segment from a video shown in fig. 6, the apparatus includes:

an obtaining module 602, configured to obtain a video to be detected;

the calculating module 604 is configured to calculate image similarities between adjacent video frames of the video to be detected respectively through a preset similarity algorithm, so as to obtain a similarity sequence; the sequence of the image similarity in the similarity sequence is the same as the sequence of the video frames;

the searching module 606 is configured to use the image similarity greater than the first segmentation threshold in the similarity sequence as a target image similarity;

the segmentation module 608 is configured to, if the arrangement order of the multiple target image similarities in the similarity sequence is continuous, take the video frames corresponding to the multiple target image similarities as candidate segments of the video to be detected.

According to the device for acquiring the candidate segment from the video, provided by the embodiment of the invention, the more accurate candidate segment can be generated through the image similarity and the segmentation strategy between the adjacent video frames, and the candidate segment has good robustness and is suitable for various video motion detection models.

In one embodiment, the segmentation module is further configured to: taking a first video frame corresponding to the similarity of the target images as a starting frame of the candidate segment, and taking a last video frame corresponding to the similarity of the target images as an ending frame of the candidate segment; and segmenting a segment from the starting frame to the ending frame from the video to be detected to obtain a candidate segment.

In another embodiment, the image similarity in the similarity sequence is identified with an index; the segmentation module is further configured to: judging whether the index marks of the similarity of the adjacent images are continuous or not; if so, judging whether the continuous index identifiers are larger than a preset number threshold value or not; and if the number of the video frames is larger than the preset number threshold, taking the video frames corresponding to the continuous index marks as candidate segments of the video to be detected.

In one embodiment, the apparatus further comprises a subdividing module configured to: taking the image similarity which is greater than a second segmentation threshold value in the similarity sequence corresponding to the candidate segment as the detail image similarity; the second segmentation threshold is greater than the first segmentation threshold; if the arrangement sequence of the similarity of the multiple subdivided images in the similarity sequence is continuous, taking the video frames corresponding to the similarity of the multiple subdivided images as first-class subdivided candidate segments of the candidate segments; and taking other segments divided by the subdivided candidate segments in the candidate segments as the second type of subdivided candidate segments.

In another embodiment, the subdivision module is further configured to: and taking the first video frame corresponding to the similarity of the plurality of subdivided images as a starting frame of the subdivided candidate segments, taking the last video frame corresponding to the similarity of the plurality of subdivided images as an ending frame of the subdivided candidate segments, and dividing the candidate segments to obtain the subdivided candidate segments.

In one embodiment, the apparatus further comprises an extension module for: respectively selecting a subdivided candidate segment from adjacent candidate segments; and taking the first video frame of the previous subdivided candidate segment as the starting frame of the lengthened candidate segment, taking the last video frame of the subsequent subdivided candidate segment as the ending frame of the lengthened candidate segment, and segmenting the video to be detected to obtain the lengthened candidate segment.

In one embodiment, the apparatus further comprises a training module configured to: setting a sequencing loss function based on the overlapping degree of the two candidate segments and the correct marked segment; the overlapping degree of the two candidate segments and the correct marked segment is different; and taking the sequencing loss function as a loss function of the video motion detection model, and training the video motion detection model through the candidate segments.

In one embodiment, the apparatus further comprises a detection module configured to: and performing motion detection on the candidate segments through a pre-configured video motion detection model.

The device provided by the embodiment has the same implementation principle and technical effect as the foregoing embodiment, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiment for the portion of the embodiment of the device that is not mentioned.

Furthermore, the present embodiment provides a processing device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the gesture recognition method provided by the above embodiment is implemented.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing embodiments, and is not described herein again.

Further, the present embodiment provides a computer-readable storage medium having stored thereon a computer program, which, when executed by a processor, performs the steps of the method provided by the above-described embodiment.

The method, the apparatus, and the computer program product of the processing device for acquiring a candidate segment from a video according to the embodiments of the present invention include a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementations may refer to the method embodiments and are not described herein again. The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for obtaining candidate segments from a video, comprising:

acquiring a video to be detected;

respectively calculating the image similarity between adjacent video frames of the video to be detected by a preset similarity algorithm to obtain a similarity sequence; wherein the ordering of image similarities in the sequence of similarities is the same as the ordering of the video frames;

taking the image similarity which is greater than a first segmentation threshold value in the similarity sequence as a target image similarity;

if the arrangement sequence of the target image similarities in the similarity sequence is continuous, taking the video frames corresponding to the target image similarities as candidate segments of the video to be detected;

setting a sequencing loss function based on the overlapping degree of the two candidate segments and the correct marked segment; the overlapping degree of the two candidate segments and the correct labeling segment is different, and the sequencing loss function is used as a loss function of a video motion detection model;

wherein the ordering loss function comprises: l_rank＝max(0,c_q-c_p+ε)，c_qRepresenting the degree of overlap of one of the two candidate segments with the correctly labeled segment, c_pRepresenting the degree of overlap of the other of the two candidate segments with the correct tagged segment.

2. The method according to claim 1, wherein the step of using the video frames corresponding to the similarity of the plurality of target images as the candidate segments of the video to be detected comprises:

taking the first video frame corresponding to the similarity of the target images as a starting frame of the candidate segment, and taking the last video frame corresponding to the similarity of the target images as an ending frame of the candidate segment;

and segmenting the segment between the starting frame and the ending frame from the video to be detected to obtain a candidate segment.

3. The method according to claim 1, wherein the image similarity in the similarity sequence is identified with an index;

if the arrangement sequence of the target image similarities in the similarity sequence is continuous, the step of taking the video frames corresponding to the target image similarities as the candidate segments of the video to be detected comprises the following steps:

judging whether the index marks adjacent to the image similarity are continuous or not;

if so, judging whether the continuous index identification is larger than a preset quantity threshold value;

4. The method of claim 1, wherein after obtaining the candidate segment, the method further comprises:

taking the image similarity which is greater than a second segmentation threshold value in the similarity sequence corresponding to the candidate segment as the detail image similarity; the second segmentation threshold is greater than the first segmentation threshold;

if the arrangement sequence of the multiple sub-image similarities in the similarity sequence is continuous, taking the video frames corresponding to the multiple sub-image similarities as the first type sub-divided candidate segments of the candidate segments;

and taking other segments of the candidate segments divided by the subdivision candidate segments as second-class subdivided candidate segments.

5. The method according to claim 4, wherein the step of using the video frames corresponding to the plurality of subdivided image similarities as the candidate segments of the first type subdivision of the candidate segments comprises:

and taking the first video frame corresponding to the similarity of the plurality of subdivided images as a starting frame of the subdivided candidate segment, taking the last video frame corresponding to the similarity of the plurality of subdivided images as an ending frame of the subdivided candidate segment, and segmenting the candidate segment to obtain the subdivided candidate segment.

6. The method of claim 4 or 5, wherein after obtaining the subdivided candidate segments, the method further comprises:

selecting one of said subdivided candidate segments among adjacent said candidate segments, respectively;

and taking the first video frame of the previous subdivided candidate segment as the starting frame of the lengthened candidate segment, taking the last video frame of the subsequent subdivided candidate segment as the ending frame of the lengthened candidate segment, and segmenting the video to be detected to obtain the lengthened candidate segment.

7. The method of claim 1, further comprising:

and taking the sequencing loss function as a loss function of the video motion detection model, and training the video motion detection model through the candidate segments.

8. The method of claim 1 or 7, further comprising:

and performing motion detection on the candidate segments through a pre-configured video motion detection model.

9. An apparatus for obtaining candidate segments from a video, comprising:

the acquisition module is used for acquiring a video to be detected;

the calculation module is used for respectively calculating the image similarity between the adjacent video frames of the video to be detected through a preset similarity calculation method to obtain a similarity sequence; wherein the ordering of image similarities in the sequence of similarities is the same as the ordering of the video frames;

the searching module is used for taking the image similarity which is greater than a first segmentation threshold value in the similarity sequence as the target image similarity;

the segmentation module is used for taking the video frames corresponding to the similarity degrees of the target images as candidate segments of the video to be detected if the arrangement sequence of the similarity degrees of the target images in the similarity sequence is continuous;

the apparatus is further configured to: setting a sequencing loss function based on the overlapping degree of the two candidate segments and the correct marked segment; the overlapping degree of the two candidate segments and the correct labeling segment is different, and the sequencing loss function is used as a loss function of a video motion detection model;

10. A processing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of the preceding claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of the preceding claims 1 to 8.