CN113365104B

CN113365104B - Video concentration method and device

Info

Publication number: CN113365104B
Application number: CN202110625553.6A
Authority: CN
Inventors: 李睿之; 吴昀蓁; 郑邦东; 李虎; 吴松霖
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2022-09-09
Anticipated expiration: 2041-06-04
Also published as: CN113365104A

Abstract

The application discloses a video concentration method and a device, relates to the technical field of computers, in particular to the technical field of artificial intelligence, and one specific implementation mode comprises the steps of obtaining a video to be concentrated and determining a target video frame in the video to be concentrated; determining the area of a pixel point of a foreground mask corresponding to a next frame of a target video frame, and dividing a video scene of a video to be concentrated based on the area of the pixel point and a preset threshold; and distributing tracks to the video scenes based on the time sequence, and fusing the video scenes based on the distributed tracks to generate the concentrated video. Therefore, the video scenes are divided by judging whether the area of the pixel points of the foreground in the foreground mask is smaller than the preset threshold value or not, and the tracks are distributed to the divided video scenes, so that the video scenes are fused based on the distributed tracks, the video concentration efficiency is improved, and the video concentration quality is improved.

Description

Video concentration method and device

Technical Field

The application relates to the technical field of computers, in particular to the technical field of artificial intelligence, and particularly relates to a video concentration method and device.

Background

Currently, when video compression is performed, a foreground and a background are generally separated based on a foreground and background separation method, and the foreground is divided into a stopped state and a moving state by determining whether the foreground is static, and then a video part in the stopped state is deleted to achieve the effect of video compression. This results in inefficient video concentration and poor quality video concentration.

In the process of implementing the present application, the inventor finds that at least the following problems exist in the prior art:

the video concentration efficiency is low, and the video concentration quality is not high.

Disclosure of Invention

In view of this, embodiments of the present application provide a video compression method and apparatus, which can solve the problems of low video compression efficiency and low video compression quality in the prior art.

To achieve the above object, according to an aspect of embodiments of the present application, there is provided a video compression method including:

acquiring a video to be concentrated, and determining a target video frame in the video to be concentrated;

determining the area of a pixel point of a foreground mask corresponding to a next frame of a target video frame, and dividing a video scene of a video to be concentrated based on the area of the pixel point and a preset threshold;

and distributing tracks to the video scenes based on the time sequence, and fusing the video scenes based on the distributed tracks to generate the concentrated video.

Optionally, determining a target video frame in the video to be condensed includes:

the following iterative steps are performed a plurality of times:

reading a video frame from a video to be condensed based on the time sequence;

extracting a foreground mask of a video frame, and further determining the pixel area of the foreground mask to be used as a second pixel area;

in response to the fact that the area of the second pixel point is smaller than the preset threshold value, determining the read video frame as a target video frame;

in response to determining that a video frame is not the last video frame of the video to be condensed, updating a video frame read from the video to be condensed based on the time sequence.

Optionally, the method for dividing the video scene to be concentrated into the video scenes by using the pixel point area of the foreground mask corresponding to the next frame of the target video frame as the first pixel point area includes:

and in response to determining that the area of the first pixel point is smaller than the preset threshold, deleting the next frame of the target video frame, and intercepting all continuous and promising video frames before the target video frame into a video scene.

Optionally, after deleting a frame subsequent to the target video frame, the method further includes:

and updating the next frame of the deleted video frame to be the next frame of the target video frame.

Optionally, allocating tracks to each video scene based on the time sequence comprises:

sequencing all video scenes according to a time sequence, and determining the time of each video scene;

determining the occupation time of the track according to the sum of the time of the video scenes inserted into the track;

determining the track with the least occupied time as a target track;

and sequentially taking out the video scenes according to the sequence of the video scenes, inserting the video scenes into the target track, and updating the occupied time of each track.

Optionally, fusing the video scenes based on the allocated tracks to generate a condensed video, including:

and aligning the starting time of the initial video scenes in each track, and further performing longitudinal splicing and fusion on the video scenes in each track to generate a concentrated video.

Optionally, before determining the target video frame in the video to be condensed, the method further includes:

reading a preset number of video frames from a video to be concentrated based on a time sequence;

carrying out background initialization on a preset number of video frames;

and carrying out foreground detection and extraction on a preset number of video frames to obtain an initialized foreground mask.

removing a preset number of video frames from video frames arranged based on the time sequence in the condensed video, and further determining a target video frame in the video to be condensed from the rest video frames.

In addition, the present application also provides a video compression apparatus, including:

the device comprises an acquisition unit, a compression unit and a compression unit, wherein the acquisition unit is configured to acquire a video to be compressed and determine a target video frame in the video to be compressed;

the video scene dividing unit is configured to determine the pixel point area of a foreground mask corresponding to a frame subsequent to the target video frame, and further divide the video scene of the video to be concentrated based on the pixel point area and a preset threshold;

and the generating unit is configured to allocate tracks to the video scenes based on the time sequence, and further fuse the video scenes based on the allocated tracks to generate the condensed video.

Optionally, the obtaining unit is further configured to:

the following iterative steps are performed a plurality of times:

reading a video frame from a video to be condensed based on the time sequence;

Optionally, taking a pixel area of a foreground mask corresponding to a subsequent frame of the target video frame as a first pixel area, the video scene dividing unit is further configured to:

Optionally, the video compression apparatus further comprises an update unit configured to:

and after deleting the next frame of the target video frame, updating the deleted next frame of the video frame to the next frame of the target video frame.

Optionally, the generating unit is further configured to:

determining the track with the least occupied time as a target track;

Optionally, the generating unit is further configured to:

Optionally, the video compression apparatus further comprises a pre-processing unit configured to:

carrying out background initialization on a preset number of video frames;

Optionally, the obtaining unit is further configured to:

In addition, the present application also provides a video compression electronic device, comprising: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the video compression method as described above.

In addition, the present application also provides a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the video compression method as described above.

One embodiment of the above invention has the following advantages or benefits: the method comprises the steps of determining a target video frame in a video to be concentrated by acquiring the video to be concentrated; determining the area of a pixel point of a foreground mask corresponding to a next frame of a target video frame, and dividing a video scene of a video to be concentrated based on the area of the pixel point and a preset threshold; and distributing tracks to the video scenes based on the time sequence, and fusing the video scenes based on the distributed tracks to generate the concentrated video. Therefore, the video scenes are divided by judging whether the area of the pixel points of the foreground in the foreground mask is smaller than the preset threshold value or not, and the tracks are distributed to the divided video scenes, so that the video scenes are fused based on the distributed tracks, the video concentration efficiency is improved, and the video concentration quality is improved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a further understanding of the application and are not to be construed as limiting the application. Wherein:

fig. 1 is a schematic diagram of a main flow of a video compression method according to a first embodiment of the present application;

fig. 2 is a schematic diagram of a main flow of a video compression method according to a second embodiment of the present application;

fig. 3 is a schematic view of an application scenario of a video compression method according to a third embodiment of the present application;

FIG. 4 is a schematic diagram of the main blocks of a video compression apparatus according to an embodiment of the present application;

FIG. 5 is an exemplary system architecture diagram to which embodiments of the present application may be applied;

fig. 6 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram of a main flow of a video compression method according to a first embodiment of the present application, and as shown in fig. 1, the video compression method includes:

step S101, obtaining a video to be concentrated, and determining a target video frame in the video to be concentrated.

In this embodiment, an execution main body (for example, a server) of the video concentration method may obtain a video to be concentrated by a wired connection or a wireless connection, where the video to be concentrated may be, for example, a bank monitoring video or a movie shooting video, and the type and specific content of the video to be concentrated are not limited in this application. After the execution subject obtains the video to be concentrated, the execution subject may iterate frame by frame to determine a target video frame in the video to be concentrated through an algorithm of foreground and background separation. Wherein the target video frame may be the first non-foreground frame after the consecutive foreground frames in the video to be condensed. Foreground (forego): refers to a class of objects (e.g., people, animals, vehicles, and other movable objects) that move in a video or have special meaning in an image. Background: generally refers to the entire environmental space in a piece of video or picture with the foreground removed. Foreground and background separation (matting): also called matting, i.e. separating a certain interesting part of an image or video from an original image or video, and combining it with another image or video to form a new image or video. The algorithm for foreground and Background separation mainly includes Visual Background extraction algorithm (ViBe) (RGB), Visual Background extraction algorithm (ViBe) (gray), Bayesian history, Codebook, mixed Gaussian model EGMM, GMM, Gaussian model, First-order Low-pass filter, Nearest node algorithm (K-Nearest Neighbors, KNN) and the like.

In some optional implementations of this embodiment, before determining the target video frame in the video to be condensed, the video condensing method further includes:

the method includes reading a preset number of video frames from a video to be concentrated based on a time sequence, specifically, taking the first 30 frames of the video to be concentrated (the taken frame number to be preprocessed is not limited in the present application) for preprocessing, so as to initialize background and foreground masks. Foreground mask (mask): also called mask, is a binary template picture used to mark the foreground, usually consisting of black and white.

Specifically, background initialization is performed on a preset number of video frames, and the execution main body can call the gaussian mixture model GMM to perform foreground and background separation calculation and iterate frame by frame to update and maintain the background, thereby implementing background initialization.

The execution main body can call GMM to perform foreground detection and extraction on a preset number of video frames to obtain an initialized foreground mask.

In this implementation, determining a target video frame in a video to be condensed includes:

the method includes the steps of removing a preset number of video frames from video frames arranged based on a time sequence in a concentrated video, and further determining a target video frame in a video to be concentrated from the remaining video frames.

Specifically, the execution subject may take the first 30 frames from the beginning in the video to be condensed to initialize the background and foreground masks, and then determine the first no-foreground frame after the continuous foreground frame from the video frames after the 31 st frame and determine the first no-foreground frame as the target video frame. It is understood that there may be a plurality of target video frames in the video to be condensed, and the execution subject may determine, as the target video frames, all of the first non-foreground frames after consecutive foreground frames in the video to be condensed.

In some optional implementations of this embodiment, determining a target video frame in the video to be condensed includes:

the following iterative steps are performed a plurality of times:

reading a video frame from a video to be condensed based on the time sequence; extracting a foreground mask of a video frame, and further determining the pixel area of the foreground mask to be used as a second pixel area; in response to the fact that the area of the second pixel point is smaller than the preset threshold value, determining the read video frame as a target video frame; in response to determining that a video frame is not the last video frame of the video to be condensed, updating a video frame read from the video to be condensed based on the time sequence.

Specifically, based on the time sequence and the pixel area of the foreground mask of each video frame in the video to be concentrated, consecutive foreground frames are determined from each video frame in the video to be concentrated (that is, a video frame whose pixel area of the foreground mask is greater than a preset threshold is determined to be a foreground frame, and vice versa is a foreground-free frame, where the preset threshold may be determined by a product of a total area of all pixel points in one video frame in the video to be concentrated and a preset value. This applies to the case where there is only one continuous set of foreground frames without a foreground frame interval. For example, the frames in the video to be condensed are sorted in time series as follows: foreground frame 1, foreground frame 2, foreground frame 3, no foreground frame, foreground frame 4, foreground frame 5, foreground frame 6, foreground frame 7, no foreground frame. The foreground frames 1, 2 and 3 are a continuous foreground frame group, i.e., the video scene 1, and the foreground frames 4, 5, 6 and 7 are another continuous foreground frame group, i.e., the video scene 2. There is only one foreground-free frame between video scene 1 and video scene 2.

As another implementation manner, the executing entity may further compare the area of the pixel point of the foreground mask of the frame subsequent to the first foreground-free video frame with a preset threshold, if the area is smaller than the preset threshold, it indicates that two consecutive video frames are foreground-free frames, it indicates that the first foreground-free frame is not a noise point, and determine the first foreground-free frame as the target video frame. It can be understood that the foreground-free frame is not a video frame without foreground, but a video frame with foreground pixel points in the video frame having an area smaller than a preset threshold. This applies to the case of a plurality of consecutive groups of foreground frames without foreground frame intervals. For example, the frames in the video to be condensed are sorted in time series as follows: foreground frame a, foreground frame b, foreground frame c, no foreground frame, foreground frame d, foreground frame e, foreground frame f, foreground frame g, no foreground frame. The foreground frames a, B and c are a continuous foreground frame group, namely a video scene A, and the foreground frames d, e, f and g are another continuous foreground frame group, namely a video scene B. There are multiple foreground-free frames between video scene a and video scene B.

Thus, the execution subject may determine the target video frames following all the consecutive foreground frames from the video to be condensed based on the time sequence. The previous frame of the target video frame is the foreground frame. The next frame of the target video frame may be a foreground frame or a non-foreground frame. Video scene: a continuous uninterrupted occurrence of two or more foregrounds in overlapping presence and a logically extended segment of each foreground in the video during the same time period contained in the video is referred to as a video scene. Different video scenes can be distinguished by the area of the foreground in the foreground mask (foreground pictures with foreground interruption in the scene certainly). That is, the video scene may be a continuous set of foreground frames consisting of one or more temporally continuous foreground frames.

Step S102, determining the pixel point area of a foreground mask corresponding to a next frame of a target video frame, and dividing a video scene of a video to be concentrated based on the pixel point area and a preset threshold value.

In some optional implementation manners of this embodiment, dividing a video scene of a video to be concentrated by using a pixel area of a foreground mask corresponding to a subsequent frame of a target video frame as a first pixel area includes:

Specifically, when the execution main body determines that the pixel area of the foreground mask of the next frame of the target video frame is smaller than the preset threshold, it is determined that both the target video frame and the next frame of the target video frame are foreground-free frames, the execution main body may delete the next frame of the target video frame, that is, delete adjacent and continuous foreground-free frames behind the target video frame, and display the target video frame as a video scene boundary, thereby intercepting all continuous and promising video frames ahead of the target video frame into one video scene (except for the target video frame, the foreground-free frames are deleted, but corresponding data of the deleted foreground-free frames are stored in the database). Specifically, the video scene boundary may be a time point or a time period, specifically, a start time point of the target video frame, or a time period from a start time to an end time of the target video frame, which is not specifically limited in this application.

And after deleting the next frame of the target video frame, updating the deleted next frame of the video frame to the next frame of the target video frame. And then continuously judging the pixel area of the foreground mask of the next frame of the updated target video frame and the size of a preset threshold, if the pixel area of the foreground mask of the next frame of the updated target video frame is still smaller than the size of the preset threshold, deleting the next frame of the updated target video frame by the execution main body, continuously updating the deleted frame of the target video frame into the next frame of the target video frame, then continuously judging the pixel area of the foreground mask of the next frame of the updated target video frame and the size of the preset threshold until the video frame with the pixel area of the foreground mask larger than the preset threshold appears, and further taking the video frame with the pixel area of the foreground mask larger than the preset threshold as a starting frame of the next video scene. That is, in particular, the execution subject is to delete all adjacent and consecutive foreground-free frames after the target video frame until the first foreground frame after the target video frame is updated, and to take the first foreground frame as the starting frame of the next video scene. Each target video frame serves as a video scene boundary between each video scene. A video scene may be composed of consecutive foreground frames between two different video scene boundaries.

And S103, distributing tracks to the video scenes based on the time sequence, and fusing the video scenes based on the distributed tracks to generate a concentrated video.

After dividing the condensed video into video scenes, the execution body may allocate tracks to the video scenes based on a time sequence, that is, based on a time sequence, and then merge the video scenes based on the allocated tracks to generate the condensed video.

Specifically, the execution subject may determine the existing occupied time of the available track, and based on the occupied time of the available track, select the track with the least occupied time for inserting the video scene, and further allocate a track to each video scene. And performing track-based longitudinal fusion on the video scenes after the tracks are distributed to generate a concentrated video.

The method comprises the steps of determining a target video frame in a video to be concentrated by acquiring the video to be concentrated; determining the area of a pixel point of a foreground mask corresponding to a next frame of a target video frame, and dividing a video scene of a video to be concentrated based on the area of the pixel point and a preset threshold; and distributing tracks to the video scenes based on the time sequence, and fusing the video scenes based on the distributed tracks to generate the concentrated video. Therefore, the video scenes are divided by judging whether the area of the pixel points of the foreground in the foreground mask is smaller than the preset threshold value or not, and the tracks are distributed to the divided video scenes, so that the video scenes are fused based on the distributed tracks, the video concentration efficiency is improved, and the video concentration quality is improved.

Fig. 2 is a schematic main flow chart of a video compression method according to a second embodiment of the present application, and as shown in fig. 2, the video compression method includes:

step S201, acquiring a video to be concentrated, and determining a target video frame in the video to be concentrated.

Step S202, determining the pixel point area of a foreground mask corresponding to a frame next to a target video frame, and dividing a video scene of a video to be concentrated based on the pixel point area and a preset threshold value.

Step S203, allocating tracks to each video scene based on the time sequence, and then fusing each video scene based on the allocated tracks to generate a condensed video.

The principle of step S201 to step S203 is similar to that of step S101 to step S103, and is not described here again.

Specifically, step S203 can also be realized by step S2031 to step S2035:

step S2031, sequence the video scenes according to the time sequence, and determine the time of each video scene.

For example, sorting the video scenes of the video to be condensed according to a time sequence may specifically be: video scene a, dropped no-foreground frame, video scene B, dropped no-foreground frame, video scene C, dropped no-foreground frame, video scene D. The time of each video scene is determined, specifically, the time for detecting the video scene a is 5 minutes, the time for detecting the video scene B is 3 minutes, the time for detecting the video scene C is 6 minutes, and the time for detecting the video scene D is 4 minutes. The boundary between the video scene a and the discarded no-foreground frame is a target video frame, and may specifically be a start time of the target video frame.

Step S2032, determining the occupation time of the track according to the sum of the time of the video scenes inserted into the track.

For example, the sum of the time of the video scenes inserted into the track, for example, there are two blank tracks a and b, when the track is selected for the first time to insert into the video scene, the execution body may choose one track to select one video scene for insertion according to the time sequence, for example, select the track a to insert into the video scene a, where the occupied time of the track a is 5 minutes and the occupied time of the track b is 0 minute.

Step S2033, the track with the least occupied time is determined as the target track.

In the above example, an a track is selected and inserted into the video scene a, where the occupied time of the a track is 5 minutes, the occupied time of the b track is 0 minutes, and the occupied time of the b track is the minimum, the b track is determined as the target track at this time, and the next video scene according to the time sequence should be inserted into the target track, that is, the b track. The target track is updated in real-time after each video scene insertion.

And S2034, sequentially taking out the video scenes according to the sequence of the video scenes, inserting the video scenes into the target track, and updating the occupied time of each track.

In this embodiment, according to the time sequence and according to the time sequence of each video scene, the video scenes are sequentially taken out and inserted into the target track updated in real time, and after the video scenes are inserted into the track, the occupied time of each track is updated to update the target track. For example, the video scenes are sorted according to a time sequence, which may specifically be: video scene a, dropped no-foreground frame, video scene B, dropped no-foreground frame, video scene C, dropped no-foreground frame, video scene D. The time of each video scene is determined, and specifically, the time for detecting the video scene a is 5 minutes, the time for detecting the video scene B is 3 minutes, the time for detecting the video scene C is 6 minutes, and the time for detecting the video scene D is 4 minutes. Selecting an a track, inserting a video scene A, wherein the occupation time of the a track is 5 minutes, the occupation time of the B track is 0 minute, updating the target track at the moment to be a B track, inserting a video scene B into the B track, determining the occupation time of the a track to be 5 minutes, determining the occupation time of the B track to be 3 minutes, determining the target track at the moment to be the B track, inserting a video scene C into the B track with the minimum occupation time, updating the occupation time of the B track to be 9 minutes, determining the occupation time of the a track to be 5 minutes, determining the target track at the moment to be the a track, inserting an execution main body into the video scene D into the a track, updating the occupation time of the a track to be 9 minutes, and updating the occupation time of the B track to be 9 minutes, when the next video scene is inserted into the track, because the updated a track, The occupation time of the tracks b is the same, one of the tracks a and b can be selected optionally for insertion, and then the subsequent video scenes are inserted by taking the track with the least occupation time as a target track until the last video scene based on the time sequence is inserted, so that the video scenes are inserted into all the available tracks, and further the longitudinal fusion operation of the video scenes on all the subsequent tracks can be performed. In this embodiment, the number of available tracks may be n, where n is a positive integer.

Step S2035, aligning the start time of the initial video scenes in each track, and then longitudinally splicing and fusing the video scenes in each track to generate a concentrated video.

Specifically, vertical splice fusion, exemplified:

the video scenes inserted in the track a are sequentially: video scene a, video scene D.

The video scenes inserted in the track b are sequentially: video scene B, video scene C.

The execution body may align the start times of video scene a and video scene B, i.e., set to be the same, and then update the times of video scene D and video scene C accordingly. Further, the execution subject may perform vertical poisson fusion on the video scene A, D and the video scene B, C, that is, since the time of the video scene a is 5 minutes, the time of the video scene B is 3 minutes, the time of the video scene C is 6 minutes, and the time of the video scene D is 4 minutes, after the vertical video scene a and the video scene B are subjected to poisson fusion, the video scene a still remains for 2 minutes, the execution subject may continue poisson fusion on the remaining 2 minutes video frames of the video scene a with the video scene C, and the fused video scene C still remains for 4 minutes, and then the execution subject may continue poisson fusion on the remaining 4 minutes videos of the video scene C with the video scene D, and further generate the condensed video.

The embodiment overcomes the problem of low concentration ratio in the traditional video concentration based on the foreground and background method. The video scene cutting can discard scenes without foreground, and the multi-track longitudinal splicing fusion can uniformly compress the scenes of different time spaces of the original video in the same time space, and can make the foreground in the fused concentrated video more compact, save video storage space, improve the efficiency of analyzing massive video monitoring videos and restore the characteristics of the original foreground in the video with high quality. In addition, the number of fusion tracks can be increased according to specific situations.

Fig. 3 is a schematic view of an application scenario of a video compression method according to a third embodiment of the present application. The video concentration method is applied to scenes for video concentration based on a foreground and background separation method. Still video, i.e., video with a fixed background, is typically generated by fixed-camera point shooting. In contrast, dynamic video refers to video with a changing background, and such video is usually captured by a fixed camera in a rotating manner or by a movable camera device. As shown in fig. 3, a server (not shown in the figure) may initially obtain a segment of video to be condensed, in this embodiment, the background of the video to be condensed is fixed, that is, the video to be condensed is a static video, and the video to be condensed is subjected to foreground and background separation preprocessing. Specifically, the first 30 frames of the video to be concentrated may be acquired (the acquired frame number is not specifically limited in the present application), and then the preprocessing is performed based on the gaussian mixture model GMM frame-by-frame iteration, so as to obtain an initialization background and an initialization foreground mask, so as to make a background and background basis for the subsequent video concentration operation. Then, the server may take a 31 st frame of the video to be condensed (the obtained several frames are not specifically limited in the present application, and here, the frame 1 st after the obtained frame to be preprocessed, that is, the 31 st frame of the video to be condensed) to perform foreground and background separation calculation, so as to extract a foreground mask of the 31 st frame, and perform denoising optimization processing on the extracted foreground mask. Specifically, the server may perform expansion and then corrosion on the foreground mask based on a morphological processing method to remove smaller noise, and then perform a second processing by using median filtering, thereby obtaining a more complete and almost noise-free and continuous foreground mask. After the foreground mask is denoised and optimized, the area X of the pixel points of the foreground mask may be calculated and compared with a preset threshold Y, where the preset threshold Y may be determined by a product of the total area of all the pixel points in the acquired frame and a preset value. The preset value may be a percentage of a pixel area of the preset foreground mask in a total pixel area in one frame, and the percentage may be determined according to an actual situation, which is not specifically limited in the present application. When the server determines that X is lower than a preset threshold Y, the server determines pixel point area Z of the foreground mask of the next frame of the current frame, and then the server can judge whether Z is also lower than Y, and if Z is also lower than Y, the server can intercept all the continuous foreground frames existing before the current frame (specifically, all the continuous foreground frames before the current frame (excluding the current frame) and after the discarded foreground-free frame closest to the current frame (excluding the discarded foreground-free frame)), so as to generate a video scene. For example, there are some divided scenes arranged in time series, and the specific arrangement may be: scene a, a discarded no-foreground frame, scene B, a discarded no-foreground frame, foreground frame 1, foreground frame 2, foreground frame 3, … foreground frame n, the current frame, the next frame to the current frame. When the pixel areas of the foreground masks of the current frame and the next frame of the current frame in the above example are both lower than the preset threshold Y, and the previous frame of the current frame is a foreground frame (that is, the current frame, i.e., the target video frame, is the first non-foreground frame after the consecutive foreground frames), the server may divide all foreground frames between the nearest discarded non-foreground frame before the current frame and the current frame into one video scene (that is, generate one video scene), that is, divide the foreground frames 1, 2, 3, … into one video scene. Then, the server may determine whether all the video frames in the video to be concentrated have been iterated, if all the video frames have been iterated, allocate tracks to the divided video scenes (the number of the tracks may be one, two, or more, which is not limited in this application), perform track-based vertical splicing and fusion on the allocated video scenes in each track to generate a fused concentrated video, and end the video concentration process. In this embodiment, when the area X of the pixel point of the foreground mask of the current frame obtained is not less than the preset threshold Y, or the area Z of the pixel point of the foreground mask of the next frame obtained is not less than Y (it indicates that the current frame may be a noise point, and the current frame may be deleted), or when the video frame in the video to be concentrated does not have the end of iteration, the method returns to obtain a frame again based on the time sequence to perform the foreground and background separation calculation again until all the iterations of the video frame in the video to be concentrated end.

In the embodiment, the existing front and back background separation technology is combined with a video concentration task, a front and back background separation algorithm mixed Gaussian model (Mixture of Gaussian) is used for obtaining a foreground and a foreground mask, the condition that the foreground is not missed to be detected does not exist, a threshold value is set according to the area size of pixel points of the foreground in the foreground mask to divide a video into a plurality of continuous scenes, and finally the divided scenes are fused by a certain time-space sequence arrangement double-track to form the concentrated video. By video enrichment, videos that are not valuable (moving foreground) will be culled. By means of video segmentation and merging, it is possible to see all the moving targets within a few seconds, trace back the functions of the original video and instantaneously lock the positions of the targets in the original video. The efficiency of analyzing the massive video monitoring videos is greatly improved, and the characteristics of the original foreground in the videos are restored in high quality.

Fig. 4 is a schematic diagram of main blocks of a video compression apparatus according to an embodiment of the present application. As shown in fig. 4, the video compression apparatus includes an acquisition unit 401, a video scene division unit 402, and a generation unit 403.

The obtaining unit 401 is configured to obtain a video to be condensed, and determine a target video frame in the video to be condensed.

The video scene dividing unit 402 is configured to determine a pixel point area of a foreground mask corresponding to a frame subsequent to the target video frame, and further divide a video scene of the video to be concentrated based on the pixel point area and a preset threshold.

A generating unit 403 configured to allocate tracks to each video scene based on the time series, and further merge each video scene based on the allocated tracks to generate a condensed video.

In some embodiments, the obtaining unit 401 is further configured to: the following iterative steps are performed a plurality of times: reading a video frame from a video to be condensed based on the time sequence; extracting a foreground mask of a video frame, and further determining the pixel area of the foreground mask to be used as a second pixel area; in response to the fact that the area of the second pixel point is smaller than the preset threshold value, determining the read video frame as a target video frame; in response to determining that a video frame is not the last video frame of the video to be condensed, updating a video frame read from the video to be condensed based on the time sequence.

In some embodiments, taking the pixel point area of the foreground mask corresponding to the next frame of the target video frame as the first pixel point area, the video scene dividing unit 402 is further configured to: and in response to determining that the area of the first pixel point is smaller than the preset threshold, deleting the next frame of the target video frame, and intercepting all continuous and promising video frames before the target video frame into a video scene.

In some embodiments, the video concentration apparatus further comprises an update unit, not shown in fig. 4, configured to: and after deleting the next frame of the target video frame, updating the deleted next frame of the video frame to the next frame of the target video frame.

In some embodiments, the generating unit 403 is further configured to: sequencing all video scenes according to a time sequence, and determining the time of each video scene; determining the occupation time of the track according to the sum of the time of the video scenes inserted into the track; determining the track with the least occupied time as a target track; and sequentially taking out the video scenes according to the sequence of the video scenes, inserting the video scenes into the target track, and updating the occupied time of each track.

In some embodiments, the generating unit 403 is further configured to: and aligning the starting time of the initial video scenes in each track, and further performing longitudinal splicing and fusion on the video scenes in each track to generate a concentrated video.

In some embodiments, the video concentration apparatus further comprises a pre-processing unit, not shown in fig. 4, configured to: reading a preset number of video frames from a video to be concentrated based on a time sequence; carrying out background initialization on a preset number of video frames; and carrying out foreground detection and extraction on a preset number of video frames to obtain an initialized foreground mask.

In some embodiments, the obtaining unit 401 is further configured to: removing a preset number of video frames from video frames arranged based on the time sequence in the condensed video, and further determining a target video frame in the video to be condensed from the rest video frames.

It should be noted that, in the present application, the video compression method and the video compression apparatus have corresponding relation in implementation contents, and therefore, the description of the repeated contents is omitted.

Fig. 5 illustrates an exemplary system architecture 500 to which the video compression method or video compression apparatus of embodiments of the present application may be applied.

As shown in fig. 5, the system architecture 500 may include

terminal devices

501, 502, 503, a network 504, and a server 505. The network 504 serves to provide a medium for communication links between the

terminal devices

501, 502, 503 and the server 505. Network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

501, 502, 503 to interact with a server 505 over a network 504 to receive or send messages or the like. The

terminal devices

501, 502, 503 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

501, 502, 503 may be various electronic devices having video processing screens and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 505 may be a server providing various services, such as a background management server (for example only) providing support for video enrichment requests submitted by users using the

terminal devices

501, 502, 503. The background management server can acquire a video to be concentrated and determine a target video frame in the video to be concentrated; determining the area of a pixel point of a foreground mask corresponding to a next frame of a target video frame, and dividing a video scene into videos to be concentrated based on the area of the pixel point and a preset threshold; the tracks are distributed to the video scenes based on the time sequence, and then the video scenes are fused based on the distributed tracks to generate the concentrated video, so that the video concentration efficiency is improved, and the video concentration quality is improved.

It should be noted that the video compression method provided by the embodiment of the present application is generally executed by the server 505, and accordingly, the video compression apparatus is generally disposed in the server 505.

It should be understood that the number of terminal devices, networks, and servers in fig. 5 are merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use in implementing a terminal device of an embodiment of the present application. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the use range of the embodiment of the present application.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the computer system 600 are also stored. The CPU601, ROM602, and RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output section 607 including a signal processing section such as a Cathode Ray Tube (CRT), a liquid crystal credit authorization inquiry processor (LCD), and the like, and a speaker and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to embodiments disclosed herein, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments disclosed herein include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609 and/or installed from the removable medium 611. The above-described functions defined in the system of the present application are executed when the computer program is executed by the Central Processing Unit (CPU) 601.

It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a video scene division unit, and a generation unit. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to obtain a video to be condensed and determine a target video frame in the video to be condensed; determining the area of a pixel point of a foreground mask corresponding to a next frame of a target video frame, and dividing a video scene of a video to be concentrated based on the area of the pixel point and a preset threshold; and distributing tracks to the video scenes based on the time sequence, and fusing the video scenes based on the distributed tracks to generate a concentrated video, so that the video concentration efficiency and the video concentration quality are improved. .

According to the technical scheme of the embodiment of the application, a target video frame in a video to be concentrated is determined by acquiring the video to be concentrated; determining the area of a pixel point of a foreground mask corresponding to a next frame of a target video frame, and dividing a video scene into videos to be concentrated based on the area of the pixel point and a preset threshold; and distributing tracks to the video scenes based on the time sequence, and fusing the video scenes based on the distributed tracks to generate the concentrated video. Therefore, the video scenes are divided by judging whether the area of the pixel points of the foreground in the foreground mask is smaller than the preset threshold value or not, and the tracks are distributed to the divided video scenes, so that the video scenes are fused based on the distributed tracks, the video concentration efficiency is improved, and the video concentration quality is improved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of video compression, comprising:

determining the pixel point area of a foreground mask corresponding to a frame next to the target video frame, and dividing the video to be concentrated into video scenes based on the pixel point area and a preset threshold;

distributing tracks to the video scenes based on the time sequence, and fusing the video scenes based on the distributed tracks to generate a concentrated video;

wherein the assigning tracks to the video scenes based on the time sequence comprises:

determining the track with the least occupied time as a target track;

2. The method according to claim 1, wherein the determining the target video frame in the video to be condensed comprises:

the following iterative steps are performed a plurality of times:

reading a video frame from the video to be condensed based on the time sequence;

extracting a foreground mask of the video frame, and further determining the pixel area of the foreground mask to be used as a second pixel area;

determining the read video frame as a target video frame in response to determining that the area of the second pixel point is smaller than a preset threshold value;

updating one video frame read from the video to be condensed based on a time sequence in response to determining that the one video frame is not the last video frame of the video to be condensed.

3. The method according to claim 2, wherein the step of taking a pixel point area of a foreground mask corresponding to a frame subsequent to the target video frame as a first pixel point area, and the step of dividing the video to be concentrated into video scenes comprises:

and in response to determining that the area of the first pixel point is smaller than a preset threshold, deleting a subsequent frame of the target video frame, and intercepting all continuous and promising video frames before the target video frame into a video scene.

4. The method of claim 3, wherein after said deleting a frame subsequent to the target video frame, the method further comprises:

5. The method of claim 1, wherein fusing the video scenes based on the assigned tracks to generate a condensed video comprises:

and aligning the starting time of the initial video scene in each track, and further performing longitudinal splicing and fusion on the video scenes in each track to generate a concentrated video.

6. The method of claim 1, wherein prior to said determining a target video frame in said video to be condensed, said method further comprises:

reading a preset number of video frames from the video to be concentrated based on the time sequence;

performing background initialization on the preset number of video frames;

and carrying out foreground detection and extraction on the preset number of video frames to obtain an initialized foreground mask.

7. The method according to claim 6, wherein the determining the target video frame in the video to be condensed comprises:

and removing the preset number of video frames from the video frames arranged based on the time sequence in the concentrated video, and further determining a target video frame in the video to be concentrated from the rest video frames.

8. A video compression apparatus, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a video to be concentrated and determine a target video frame in the video to be concentrated;

the video scene dividing unit is configured to determine the pixel point area of a foreground mask corresponding to a frame next to the target video frame, and further divide a video scene for the video to be concentrated based on the pixel point area and a preset threshold;

a generating unit configured to allocate tracks to each video scene based on the time series, and further to fuse each video scene based on the allocated tracks to generate a condensed video; wherein the allocating tracks to each video scene based on the time sequence comprises: sequencing all video scenes according to a time sequence, and determining the time of each video scene; determining the occupation time of the track according to the sum of the time of the video scenes inserted into the track; determining the track with the least occupied time as a target track; and sequentially taking out the video scenes according to the sequence of the video scenes, inserting the video scenes into the target track, and updating the occupied time of each track.

9. The apparatus of claim 8, wherein the obtaining unit is further configured to:

the following iterative steps are performed a plurality of times:

extracting a foreground mask of the video frame, and further determining the pixel point area of the foreground mask to be used as a second pixel point area;

10. A video compression electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

11. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.