CN115037992A

CN115037992A - Video processing method, device and storage medium

Info

Publication number: CN115037992A
Application number: CN202210645927.5A
Authority: CN
Inventors: 赵伟; 韩铮; 熊伟龄; 关键; 慕永晖; 王峰; 秦萠; 林辉
Original assignee: China Media Group
Current assignee: China Media Group
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2022-09-09

Abstract

The embodiment of the disclosure provides a video processing method, a video processing device and a storage medium. The method comprises the following steps: detecting the positions of the moving object in the current video frame and the first video frame; the first video frame includes: a video frame preceding the current video frame in the target video clip; the target video clip is a video clip containing a moving target in a live video; predicting the position of the moving object in the second video frame according to the positions of the moving object in the current video frame and the first video frame to obtain a position prediction result; the second video frame includes: a video frame subsequent to the current video frame in the target video segment; cutting out a target area image containing a moving target from the second video frame according to the position prediction result; separating a target image of the moving target from the target area image; the target image is used for generating a time slice special effect; the time slice special effect is used for presenting the motion states of the motion object at different time points on the same frame picture.

Description

Video processing method, device and storage medium

Technical Field

The present disclosure relates to the field of computer vision technologies, and in particular, to a video processing method, apparatus, and storage medium.

Background

With the continuous development of the internet technology, various sports events are live broadcast, so that audiences can watch the sports events in real time.

For example, in the live broadcasting process of sports events such as ski matches, in order to improve the viewing experience of users, a motion stunt can be made on a video shot by a camera at a fixed machine position in a pure manual or image auxiliary tool mode, so that audiences can know the action skills of athletes.

However, the video processing method relies heavily on manual processing, which not only causes unnecessary labor cost consumption, but also has long production time, and is difficult to meet the time efficiency requirement of sports live broadcast.

Disclosure of Invention

The embodiment of the disclosure provides a video processing method, a video processing device and a storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a video processing method, the method including:

detecting the positions of the moving target in the current video frame and the first video frame; wherein the first video frame comprises: a video frame preceding the current video frame in the target video segment; the target video clip is a video clip containing the moving target in a live video;

predicting the position of the moving target in a second video frame according to the positions of the moving target in the current video frame and the first video frame to obtain a position prediction result; wherein the second video frame comprises: a video frame subsequent to the current video frame in the target video segment;

cutting out a target area image containing the moving target from the second video frame according to the position prediction result;

separating a target image of the moving target from the target area image; the target image of the moving target is used for generating a time slice special effect, wherein the time slice special effect is used for presenting the moving states of the moving target at different time points on the same frame of picture.

According to a second aspect of embodiments of the present disclosure, there is provided a video processing apparatus, the apparatus comprising:

the detection module is used for detecting the positions of the moving target in the current video frame and the first video frame; wherein the first video frame comprises: a video frame preceding the current video frame in the target video segment; the target video clip is a video clip containing the moving target in a live video;

the prediction module is used for predicting the position of the moving target in a second video frame according to the positions of the moving target in the current video frame and the first video frame to obtain a position prediction result; wherein the second video frame comprises: a video frame subsequent to the current video frame in the target video segment;

the cutting module is used for cutting out a target area image containing the moving target from the second video frame according to the position prediction result;

the separation module is used for separating a target image of the moving target from the target area image; the target image of the moving target is used for generating a time slice special effect, wherein the time slice special effect is used for presenting the moving states of the moving target at different time points on the same frame of picture.

According to a third aspect of embodiments of the present disclosure, there is provided a computer device including:

a processor and a memory for storing executable instructions operable on the processor, wherein:

the processor is configured to execute the executable instructions, and the executable instructions perform the steps of the video processing method according to the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the steps in the video processing method according to the first aspect.

According to the video processing method, the video processing device and the storage medium, the position of the moving object in the second video frame after the current video frame is predicted according to the positions of the moving object in the current video frame and the first video frame before the current video frame, and then the target area image containing the moving object is cut out from the second video frame according to the position prediction result, so that the target area image containing the moving object is cut out from the local area of the second video frame corresponding to the moving object based on the position prediction result during cutting, the calculation cost of target area image detection in the video processing process can be reduced, and the processing time of a system algorithm is reduced. Furthermore, the target image of the moving target is separated from the target area image to be used for generating the time slice special effect, the lossless time slice special effect video can be generated quickly, the time effect requirement of the sports live broadcast is met, and meanwhile the watching experience of a user of the sports live broadcast can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic flowchart illustrating a video processing method according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart illustrating a video processing method according to an embodiment of the present disclosure;

FIG. 3 shows a schematic flow diagram of step S10 of the method of FIG. 2;

FIG. 4 shows a schematic flow chart of step S12 in the method of FIG. 1;

FIG. 5 shows a schematic flow chart of step S13 in the method of FIG. 1;

FIG. 6 shows a schematic flow chart of step S14 in the method of FIG. 1;

fig. 7a is a schematic diagram illustrating a hardware architecture of a video processing system according to an embodiment of the present disclosure;

fig. 7b is a schematic diagram illustrating a software architecture of a video processing system according to an embodiment of the present disclosure;

fig. 7c is a schematic diagram of a video frame image for triggering detection provided by the embodiment of the present disclosure;

FIG. 7d is a schematic diagram of a pre-processed video frame image provided by an embodiment of the present disclosure;

FIG. 7e is a schematic diagram illustrating a cropped image provided by an embodiment of the present disclosure;

FIG. 7f is a schematic diagram illustrating a frame-by-frame fused image provided by an embodiment of the disclosure;

FIG. 7g is an image diagram illustrating a data and cache structure provided by an embodiment of the disclosure;

fig. 8 shows a schematic structural diagram of a video processing apparatus provided in an embodiment of the present disclosure;

fig. 9 shows a block diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with embodiments of the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosed embodiments, as detailed in the appended claims.

The terminology used in the embodiments of the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present disclosure. As used in the disclosed embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information in the embodiments of the present disclosure, such information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of embodiments of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

The video processing method provided by the embodiment of the disclosure can be suitable for sports video special effect making of a live scene, wherein the live scene can be live sports events, such as air skill skiing sports, skating, diving or gymnastics. The video processing method may be performed by a video processing apparatus, which may be implemented by means of software and/or hardware, and may be generally integrated in a computer device, such as a terminal device or a server.

Fig. 1 shows a flowchart of a video processing method provided by an embodiment of the present disclosure. Referring to fig. 1, the video processing method may include the steps of:

s11, detecting the positions of the moving object in the current video frame and the first video frame; wherein the first video frame comprises: a video frame preceding the current video frame in the target video segment; the target video clip is a video clip containing the moving target in a live video;

the live video may be a video of the entire athletic track for the sporting event that is captured by the image capture device. Each video frame of the live video includes a picture of the entire sports track. Wherein different video frames have different timestamps, where the timestamp of one video frame can be used to determine the capture moment of the video frame.

The video frames in the live video may be stored in RGB format in a fifo (First Input First output) First-in First-out buffer memory created in advance in the memory, where the size of the buffer memory may be predetermined or dynamically determined according to the size of the memory.

The moving target can be a target object which is in a moving state at the track or field covered by the image acquisition range. For example, the sports targets include a target player who plays a sports game and sports equipment of the target player.

In addition, the moving object may be a moving object determined by the video processing apparatus according to the received user selected operation.

Wherein the current video frame may be a video frame currently to be processed among a plurality of video frames of the target video clip.

The detection timing of the first video frame in the target video segment is before the current video frame. The first video frame may be, for example, a previous video frame adjacent to the current video frame or another video frame prior to the current video frame.

In some embodiments, when the first video frame (or the current video) is the first video frame of the target video segment, the actual position of the moving target in the first video frame may be determined using a preset detection algorithm. The preset detection algorithm may be a target detection method based on deep learning, and the like, which is not specifically limited in this embodiment.

In some embodiments, when the first video frame or the current video frame is a first video frame of the target video segment, a region range where the moving object is located may be determined in the first video frame, and the position of the moving object may be detected within the region range where the moving object is located.

Considering that in a practical application scenario, the range of the area where the athlete appears in the first video frame is relatively fixed, when the position of the moving object in the first video frame is detected, the position of the moving object can be directly detected within the range of the area without detecting the position of the moving object in the whole picture area of the first video frame, so that the video processing efficiency can be improved.

In other embodiments, when the first video frame (or the current video) is a video frame after the first video frame of the target video segment, a position prediction result of the moving object in the first video frame may be obtained based on an actual position of the moving object in a video frame before the first video frame (or the current video), then a corresponding local area of the moving object in the first video frame (or the current video) may be determined according to the position prediction result, and an actual position of the moving object in the corresponding local area in the first video frame (or the current video) may be determined by using a preset detection algorithm.

S12, predicting the position of the moving object in the second video frame according to the positions of the moving object in the current video frame and the first video frame to obtain a position prediction result; wherein the second video frame comprises: a video frame subsequent to the current video frame in the target video segment.

The processing timing of the second video frame in the target video segment is subsequent to the current video frame. The second video frame may be, for example, a subsequent video frame adjacent to the current video frame or another video frame subsequent to the current video frame.

In specific implementation, a position time sequence of the moving object may be generated according to the position of the moving object in the current video frame and the position in the first video frame, and by combining timestamps of each video frame in the current video frame and the first video frame, the position time sequence is input into a preset position prediction model, and the position of the moving object in the second video frame is predicted through the position prediction model, so as to obtain a corresponding position prediction result of the moving object in the second video frame.

The position prediction model may adopt a pre-trained time sequence Neural Network model, such as a multilayer RNN (Recurrent Neural Network) model, or a multilayer LSTM (Long Short-Term Memory) model.

In addition, the position prediction model can also be a mathematical model determined based on a live scene of the moving object, for example, when the live scene is an air skill skiing sport, the position prediction model can be a tossing sport model.

It is understood that, when the second video frame includes a plurality of video frames following the current video frame, the position of the moving object in each of the video frames included in the second video frame needs to be predicted separately.

And S13, cutting out a target area image containing the moving target from the second video frame according to the position prediction result.

In specific implementation, the corresponding local area of the moving object in the second video frame may be determined according to the corresponding position prediction result of the moving object in the second video frame; and detecting the moving target in the local area to determine a target area image containing the moving target, and cutting out the target area image containing the moving target from the second video frame. Here, the range of the local area covers a target area image including a moving target.

The moving target can be detected in a corresponding local area of the moving target in the second video frame by adopting the preset detection algorithm, and a target area image containing the moving target is determined.

In this embodiment, based on the position prediction result, only the target area image containing the moving target needs to be cut out from the local area of the second video frame corresponding to the moving target, so that the processing time of the system algorithm can be effectively reduced.

In some embodiments, a target area image containing a moving target cropped from a second video frame is associated with a timestamp of the second video frame.

S14, separating a target image of the moving target from the target area image; the target image of the moving target is used for generating a time slicing special effect, and the time slicing special effect is used for presenting the motion states of the moving target at different time points on the same frame picture.

In specific implementation, a deep learning algorithm may be adopted to extract contour information of a moving object, separation processing may be performed on effective picture data in an object region image according to the contour information of the moving object, transparency information of each pixel in a second video frame may be determined according to a separation processing result, and finally, an object image of the moving object may be obtained.

In some embodiments, separating the target image of the moving target from the target area image may include:

the method comprises the steps of carrying out cutout processing on a moving object on an object region image by adopting a deep learning algorithm, wherein the background color in the object region image can comprise a plurality of colors.

In some embodiments, to further reduce the processing time of the system algorithm, a frame-extracting manner (processing once every preset interval frame number) may be adopted to separate the target image of the moving target from the plurality of target area images. The preset number of interval frames may be set according to actual application needs, which is not specifically limited in the embodiment of the present disclosure.

In this embodiment, the target image of the moving target is used to generate the time slice special effect, so that a series of instantaneous images of the moving target can be displayed on the picture through the time slice special effect, and the effect of motion visual persistence is visually displayed, so that the audience can visually see the ultrahigh-definition motion details at each moment in the same picture, thereby rapidly understanding the motion skills of the athlete and increasing the viewing interest of the audience on sports live broadcast.

The embodiment of the disclosure provides a video processing method, which predicts the position of a moving object in a second video frame after a current video frame according to the positions of the moving object in the current video frame and a first video frame before the current video frame, and cuts out an object area image containing the moving object from the second video frame according to a position prediction result, so that when cutting out, only the object area image containing the moving object needs to be cut out from a local area of the second video frame corresponding to the moving object based on the position prediction result, thereby reducing the calculation overhead of object area image detection in the video processing process and reducing the processing time of a system algorithm. Furthermore, the target image of the moving target is separated from the target area image to be used for generating the time slice special effect, the lossless time slice special effect video can be generated quickly, the time effect requirement of the sports live broadcast is met, and meanwhile the watching experience of a user of the sports live broadcast can be improved.

In one embodiment, as shown in fig. 2, before the step S11 is executed based on fig. 1, the method further includes:

s10, according to the video frame of the moving object appearing in the live video, determining the video clip of the moving object appearing in the preset area of the video frame as the target video clip.

Wherein the preset area of the video frame can be a pre-marked motion area. For example, for a skimming ski sport, the video frame preset area may contain the area between the take-off area and the landing stop area.

In the embodiment, the video segment of the moving target appearing in the preset area of the video frame is determined as the target video segment, so that the video frame in the target video segment is only required to be processed to obtain the target image of the moving target, and the finally generated time slice special effect can meet the watching requirement of a user in a more targeted manner.

In one embodiment, as shown in fig. 3, in the step S10, the determining, as the target video segment, a video segment in which the moving object appears in a preset area of a video frame according to the video frame in which the moving object appears in the live video, may include:

s101, determining a first candidate video frame in the live video; wherein the first candidate video frame comprises the moving object entering a preset starting area.

Here, the preset start area is a movement start area of the entire sports track, and accordingly, the preset end area is a movement end area of the entire sports track. For example, for a skimming ski sport, the preset start area may be a take-off area and the preset end area may be a landing stop area.

Specifically, trigger detection may be performed in a preset start area according to each video frame picture in the live video, and when it is detected that a moving object that meets preset information enters the preset start area, a video frame including the moving object that enters the preset start area is determined as a first candidate video frame.

In addition, in order to ensure the accuracy of detecting the moving object in the preset starting area, in practical application, a background difference method can be adopted, that is, the background can be changed greatly after the moving object enters the preset starting area, and when the foreground in the residual image reaches a preset proportion, the foreground frame is output according to the binary image.

In addition, erosion operation can be adopted to reduce the interference of object movement and noise in the background, so as to further improve the accuracy of detecting the moving target.

S102, determining a second candidate video frame in a plurality of video frames after the first candidate video frame; the second candidate video frame comprises the moving target moving to a preset ending area;

specifically, triggering detection is carried out on a video frame after a first candidate video frame in a preset ending area, and when it is detected that an athlete according with preset information moves to the preset ending area, the video frame including the moving object entering the preset starting area is determined as a second candidate video frame.

S103, acquiring the target video clip according to the first candidate video frame and the second candidate video frame.

Specifically, the nth video frame after the second candidate video frame is determined as a third candidate video frame, and the target video segment is acquired based on a plurality of video frames located between the first candidate video frame and the third candidate video frame. Here, n may be a preset value, and n is a positive integer greater than or equal to 1. Or, the interval duration between the third candidate video frame and the second candidate video frame is set as a preset duration, where the preset duration may be set according to the actual application requirement, and is not specifically limited here.

In the embodiment of the disclosure, when a target video segment containing a moving target appearing in a preset area of a video frame is obtained from a live video, detection of the moving target does not need to be performed on each video frame in the live video, and only a first candidate video frame containing the moving target entering a preset starting area needs to be determined, a second candidate video frame containing the moving target moving to a preset ending area needs to be determined in a plurality of video frames after the first candidate video frame, and the target video segment is obtained according to the first candidate video frame and the second candidate video frame. Therefore, the calculation overhead of target area image detection in the video processing process can be reduced, so that lossless time slice special effects video can be generated quickly, the time efficiency requirement of sports live broadcast is met, and the watching experience of users of sports live broadcast can be improved.

In an embodiment, as shown in fig. 4, in the step S12, the predicting the position of the moving object in the second video frame according to the positions of the moving object in the current video frame and the first video frame to obtain the position prediction result may include:

s121, predicting the motion track of the motion target according to the positions of the motion target in the current video frame and the first video frame.

The predicted motion trajectory of the moving object refers to a moving trajectory of the moving object within a preset time period after the current time.

Specifically, according to the position of the moving object in the current video frame and the position in the first video frame, and in combination with the time stamps corresponding to the current video frame and each video frame in the first video frame, the moving direction and the moving speed of the moving object can be determined; and predicting the motion trail of the motion target according to the motion direction and the motion speed of the motion target.

For example, for a live scene of air skill skiing sports, the movement of the athlete relative to the camera position is tangential projectile movement, so that the movement track of the athlete can be understood as the deformation of a parabola on a space projection shot by a camera, the movement direction and the movement speed of the athlete can be relatively accurately obtained by utilizing the calculation of the oblique projectile movement, and then the movement track of the moving target can be predicted to obtain a series of positions of the athlete in a screen space.

In some embodiments, to reduce the computational overhead, the predicted motion trajectory of the moving object may be a motion trajectory region corresponding to the moving object. The motion trail area corresponding to the motion target at least comprises: regions where moving objects may appear in the second video frame.

And S122, predicting the position of the moving target in the second video frame according to the motion track prediction result of the moving target to obtain a position prediction result.

Specifically, the motion trail of the moving object is predicted according to the positions of the moving object in the current video frame and the first video frame, and the position of the moving object in the second video frame is predicted according to the motion trail prediction result of the moving object and the timestamp corresponding to the second video frame.

Therefore, the area of the moving target which is possibly generated in the second video frame can be rapidly determined, so that the moving target is detected only from the area which is possibly generated in the second video frame in the following process, the detection range of the moving target can be effectively reduced, and the video processing efficiency is improved.

In one embodiment, as shown in fig. 5, the cropping out the target area image containing the moving object from the second video frame according to the position prediction result in step S13 may include:

s131, determining a cutting frame of the moving object in the second video frame according to the position prediction result.

Specifically, according to the position prediction result, a local area corresponding to the moving object in the second video frame may be determined, the moving object may be detected for a picture included in the local area, and according to the detection result of the moving object, the outermost edge of the athlete and the equipment thereof may be determined as an effective picture cropping range.

S132, cutting out a target area image containing the moving target from the second video frame according to the cutting frame of the moving target in the second video frame.

In the embodiment of the disclosure, the cutting frame of the moving object in the second video frame is determined according to the position prediction result, and then the target area image containing the moving object is cut out based on the cutting frame, so that the detection range of the moving object can be effectively reduced, and the processing efficiency is improved.

In some embodiments, in order to prevent the person or object having interference in the picture from affecting the accuracy of the detection, the step S13 may further include: interference is removed by setting an exclusion zone in advance. Wherein the detection priority weight of the excluded region is lower than the detection priority of the non-excluded region. It will be appreciated that if the player and his equipment enters the zone and is treated as normal, the player and equipment leave and continue to mark the zone as excluded.

In one embodiment, as shown in fig. 6, the separating the target image of the moving target from the target area image in step S14 may include:

and S141, extracting the characteristic information of the moving target based on a preset deep learning model.

The deep learning model may include, but is not limited to: convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and the like.

The characteristic information of the moving object comprises: a first resolution feature and a second resolution feature; wherein the resolution of the second resolution feature is lower than the resolution of the first resolution feature.

Specifically, the feature information of the moving object may be extracted using an encoder network in a Multi-feature Fusion (Multi-feature Fusion) model.

S142, determining the foreground part of the target area image according to the characteristic information of the moving target.

Specifically, a decoder network in a multi-feature fusion model may be used to determine the foreground portion of the target area image according to the feature information of the moving target.

In this embodiment, the first resolution feature and the second resolution feature provided by the encoder network may be subjected to multi-feature fusion processing by a decoder network in the deep learning model, and different responses are output for a foreground region and a background region of the target region image.

The first resolution characteristic is obtained by extracting the characteristic from a shallow convolution kernel of the encoder network, the loss of image detail information is less, a large amount of foreground edge characteristic information is contained, and the high-precision foreground edge contour is restored by the decoder network. The second resolution characteristic is obtained by extracting the characteristic from a deep convolution kernel of the encoder network, so that a large amount of image detail information is lost, and the second resolution characteristic contains rich foreground semantic information, thereby being beneficial to distinguishing foreground information and background information by the decoder network.

The decoder network finally fuses the feature information of different layers, combines the feature information with the feature image of high resolution on the basis of the feature image of low resolution, performs operations such as up-sampling and the like, gradually increases the resolution image of the feature image, and finally recovers the high-precision semantic image, namely, the target image of the moving target can be output.

In some embodiments, the training process of the multi-feature fusion model may include: foreground human body image frames with different postures and angles are obtained from a sample motion video, corresponding Mask (Mask) images are generated through image processing software to serve as training materials, multiple synthetic materials with different backgrounds are generated through a background synthesis method in the training process, pixel transformation and space transformation are added, and the generalization of a model and the anti-interference capability of the model to different background textures are improved.

S143, determining transparency information of a plurality of pixel points in the target area image according to the foreground part of the target area image.

Specifically, the transparent channel bit of each pixel in the target area image may be increased, and the transparent channel bits of the pixel data in the target area image except for the foreground portion of the target area image may be marked as transparency according to the foreground portion of the target area image. Wherein, the transparent channel bit of the foreground part of the target area image is marked as opacity.

And S144, acquiring the target image of the moving target according to the foreground part of the target area image and the transparency information of the plurality of pixel points.

In the embodiment of the disclosure, the target image of the moving target is separated from the target area image through the deep learning image algorithm, and the picture information of the moving target can be extracted quickly and accurately, so that lossless time slicing special effect video can be generated quickly, and the time efficiency requirement of sports live broadcast is met.

In some embodiments, the above separation process may further include: the processing method based on the maximum connected domain is used for removing the isolated region without connection between the background and the foreground in the alpha transparent channel image, so that useless isolated background information can be prevented from being left, and the misjudgment condition possibly generated in the separation process is reduced.

In one embodiment, the method may further comprise:

acquiring pose data of the moving target according to lens parameters used for shooting the live video and a target image of the moving target; and performing association cache on the pose data of the moving target and the target image of the moving target.

The pose data of the moving target can comprise spatial position information of the moving target in a three-dimensional space, wherein the spatial position information comprises height information of the moving target relative to a takeoff plane.

Specifically, the three-dimensional spatial data of the target image can be subjected to inverse distortion calculation through known lens parameters, so that approximate spatial position information is obtained, and the obtained spatial position information and the target image are stored in a data file in a related mode.

In the embodiment of the disclosure, by performing association cache on the pose data of the moving target and the target image of the moving target, the time slice special effect of the moving target can be generated conveniently on the basis of the pose data of the moving target and the target image which are stored in association cache.

In one embodiment, the method may further comprise:

and synthesizing a plurality of target images of the moving target according to the plurality of pose data of the moving target to obtain a synthesized image, wherein the synthesized image is used for generating the time slice special effect.

Specifically, according to the pose data of the moving target, the target images of the moving target may be sequentially superimposed and fused in the Alpha Blend transparent channel fusion manner according to the timestamp order, and the composite image may be obtained.

In the process of superposition fusion, the athlete body pictures containing the transparent channels in the buffer area can be arranged to be fused frame by frame to form an independent composite layer. Illustratively, the first target image [01] and the second target image [02] are fused according to the time sequence, after the synthetic layer is formed, the third target image [03] and the synthetic layer are superposed and fused again, and the subsequent target images are processed in sequence. When forming the composite layer, the transparent shade of each motion image is added with the transparent channel of the composite layer to form the transparent shade of the composite layer, and finally the complete composite layer is formed.

When the fused image transparent channel has a repeated region, for example, a composite layer obtained by combining the second target image [02] and the first target image [01] and the third target image [03] may be combined in a manner as follows: adding the third target image [03] with the transparent channel of the synthetic layer and controlling the maximum value of the addition result to be 1, wherein the non-overlapped pixels keep the original transparency value; wherein the R value of the overlapped pixels is: r [03] + Alpha [03] + R [ composite ] (1-Alpha [03]), G, B are processed in the same way, and finally the resulting RGB values are written into the composite layer.

In one embodiment, the live video is a video captured by a camera. Therefore, all the work flows can be completed by only using one camera as an input source, and complex multi-angle camera deployment is not needed.

In some embodiments, the method may further comprise:

the live video is captured by a camera.

In some embodiments, the method may further comprise:

and outputting a composite image synthesized based on a plurality of target images of the moving target.

Specifically, the composite image is output in a preset manner.

Wherein the preset mode comprises at least one of the following modes:

copying a memory;

outputting a video file in a specified format;

and playing in a YUV signal or RGB signal mode.

Hereinafter, the video processing method provided by the present disclosure will be described with reference to specific embodiments.

In the process of making a motion video special effect of a live scene, for example, making a skimming motion video special effect, a pure manual or image auxiliary tool mode is mainly adopted, firstly, jumping action pictures of athletes in a video shot by a camera are scratched frame by frame, and then, layers are combined into an original video according to the sequence. There are several significant disadvantages to such methods: 1) the method is seriously dependent on manual treatment, unnecessary labor cost consumption is generated, and the treatment effect is completely dependent on the experience of operators;

2) the processing process is complicated and long in flow, and sometimes can be completed by using a plurality of pieces of software; 3) the preparation time is long, and the time effect requirement of sports direct seeding is difficult to meet; 4. the interpretation capability of the picture is poor, and no motion data is supported.

The embodiment of the disclosure provides a video processing method, which can be applied to high-quality professional live broadcast scenes such as broadcast television of air skill skiing sports, and the like, is manufactured by adopting an artificial intelligence deep learning algorithm and combining a video synthesis technology, carries out rapid automatic processing on video signals acquired by a camera, calculates and separates and extracts the pictures of a selected athlete and skiing appliances worn by the athlete frame by frame, and finally renders a video containing various information pictures and outputs the video by fusing with corresponding process videos. So, can realize the harmless preparation of the automatic super high definition broadcast television live of aerial trick skiing motion, have unmanned on duty, generate harmless time slice trick video fast, characteristics such as video nondestructive processing for improve aerial trick skiing motion's reinforcing and watch experience.

Fig. 7a is a schematic diagram of a hardware architecture of a video processing system according to an embodiment of the present disclosure, and as shown in fig. 7a, the hardware of the system mainly includes: the system comprises one or more devices such as a video camera with a video acquisition and capture function, a video processing workstation, a display (monitoring device), an internal or external storage unit (used for storing files), and the like.

Fig. 7b is a schematic diagram of a software architecture of a video processing system according to an embodiment of the present disclosure, and as shown in fig. 7b, software modules of the system mainly include: the system comprises a video acquisition module, a trigger module, an image algorithm module, a storage module, a motion analysis and calculation module and a special effect synthesis module. The software modules are all installed in independent or multiple workstations and automatically called according to preset logic.

In the embodiment, the detection of athletes in video pictures, the cutting of effective pictures, the background elimination, the calculation of picture motion information, and the fusion of pictures and videos obtained by calculation are realized by adopting various computer vision algorithms, and finally, complete video streams and files are output. Because the complete processing process of each frame may not be completed before the next frame is obtained, each frame of picture can be processed in parallel by adopting a multi-process mode, and the efficiency is improved. When the system is activated, each frame of picture can completely pass through the whole flow. The process can be specifically described as follows:

a. video capture

The video acquisition module acquires and acquires video pictures, converts the video pictures into RGB format and stores the RGB format in FIFO (first in first out) cache created in the memory, and the size of the cache can be preset or can be dynamically determined according to the size of the memory.

If the HDR information exists, the HDR information is marked as HDR video, and when the HDR video is finally output, the HDR information is written into a final result.

b. Detecting a trigger

i) Triggering and detecting each frame of picture in a starting area, locking the strip and all subsequent buffer queues after detecting that athletes meeting preset information enter, not clearing, marking the strip as a processing system to run, and starting a subsequent processing flow;

ii) when the processing system is running, triggering and detecting each frame of picture in the ending area, and when detecting that the athlete meeting the preset information enters, locking the frame and the buffer memory with the preset time length, and then not storing the acquired video data until the process is terminated or the program is restarted.

Fig. 7c is a schematic diagram of a video frame image for triggering detection provided by the embodiment of the disclosure, where an arrow a1 in fig. 7c points to a start area in the video frame, and an arrow a2 in fig. 7c points to an end area in the video frame.

Wherein the detection part is triggered by a detection tracking algorithm. The detection tracking algorithm comprises a detection model and a tracking model. When the processing system runs, the detection model detects the human body in the starting area. When the image which accords with the human body characteristics enters the detection area, the current target is tracked until the moving human body flies to the ending area, and then the human body detection is carried out on the starting area again.

In addition, in order to ensure the accuracy of detecting the human body in the starting area, a background difference method can be adopted, namely, the background can be changed greatly after the moving human body enters the starting area, when the foreground in the residual image reaches a certain proportion, a foreground frame is output according to the binary image, and in addition, corrosion operation can be introduced to reduce the interference of object movement and noise in the background.

c. Frame image preprocessing

Aiming at a real application scene, a local background image on a flight track is obtained in advance by predicting a flight curve of an athlete under a real background, so that the video processing efficiency is improved.

i) Under the operation state of the processing system, limb detection of the athlete is carried out aiming at each frame of picture, an effective picture cutting range is determined by detecting the determined outermost edges of the athlete and equipment thereof through an algorithm, and data in the effective range is taken out and put into a subsequent picture processing flow;

ii) in order to prevent the interference in the picture from occurring human or object influence system judgment, the interference is removed by defining an exclusion area. The detection priority weight of the exclusion area is low, if the athlete and the equipment enter the area, the athlete and the equipment are processed according to normal conditions, and the athlete and the equipment are continuously marked as exclusion after leaving;

fig. 7d is a schematic diagram of a preprocessed video frame image according to an embodiment of the disclosure, where B in fig. 7d represents an effective cropping range in the video frame, and C in fig. 7d represents an excluded area in the video frame.

And iii) cutting each frame according to the collected mark sequence and storing the cut frames into the memory.

Fig. 7e is a schematic diagram of a cropped image provided in an embodiment of the present disclosure, where [01], [02], [03], [04], [05] in fig. 7e respectively show the sequence of the cropped image.

d. Deep learning algorithm computation

And processing the cut effective picture data by applying a deep learning algorithm, adding a transparent channel bit of each pixel, marking the data except for athletes and equipment in the picture as the calculated transparency, and writing the data into the memory frame by frame.

The deep learning algorithm adopts a multi-feature fusion model to extract feature information of a foreground human body in an image, so that the model outputs high response to a human body foreground region and low response to a background region are realized. The model training and optimization may specifically include: the method comprises the steps of obtaining foreground human body image frames with different postures and angles from a skiing motion video, generating corresponding mask images as training materials through image processing software, generating multiple synthetic materials with different backgrounds through a background synthesis method in the training process, and meanwhile adding pixel transformation and space transformation to improve the generalization of a model and the anti-interference capability of the model to different background textures.

In addition, in consideration of the situation that the deep learning algorithm may generate wrong judgment in actual use, a maximum connected domain processing module can be adopted in a post-processing part of the deep learning algorithm to remove an isolated region in which the background and the foreground are not connected in an alpha transparent channel image, so that useless isolated background information is prevented from being left.

e. Pose data calculation

The three-dimensional space data of the cut effective picture data is subjected to inverse distortion calculation through the known lens parameters, an approximate space position is obtained, and the approximate space position is stored in a data file in a format corresponding to the frame number so as to facilitate subsequent calculation.

Aiming at the air skill skiing sport, a height gauge vertical to the ground can be calibrated on the plane of the takeoff platform, the corresponding single pixel length and the 0-point pixel position of the bottom of the center of the takeoff platform are obtained, the height of the highest (low) point of each frame of athlete in the jumping process relative to the takeoff plane can be obtained by subtracting the 0-point pixel position from the highest (low) point pixel position of the athlete body frame detected by each frame, and then the pixel length is multiplied, so that an actual height curve for the jumping of the athlete can be obtained by continuous drawing.

f. Video composition

And taking the video buffer area as a bottom layer, and overlapping the pictures in the buffer area in a mode of Alpha Blend transparent channel fusion according to the time sequence and the corresponding position coordinates to generate a composite layer.

Wherein, the athlete body pictures containing the transparent channel in the buffer area are arranged to be fused frame by frame to form an independent composite layer.

Fig. 7f is a schematic diagram of frame-by-frame fusion images provided by the embodiment of the present disclosure, and as shown in fig. 7f, an image [02] and an image [01] are fused first, after a composite layer is formed, the image [03] and the composite layer are superposed and fused again, subsequent images are sequentially processed, and a complete composite layer is finally formed and finally synthesized with a final video. The transparency mask for each image is summed with the transparency channels of the composite layer to form a transparency mask for the composite layer.

When the transparent channel of the fused image has a repeat region, if the composite layer after [02] is combined with [03], the combining method is: [03] adding the obtained product with the transparent channel of the synthetic layer and controlling the maximum value to be 1, wherein the non-overlapped pixels keep the original transparency value; the R values of the overlapping pixels are: r ═ R [03] + Alpha [03] + R [ composite ] (1-Alpha [03]), G, B are processed in the same way, and finally the RGB values are written in the composite layer.

g. Output or landing discs

i) Directly storing frame-by-frame picture data containing a transparency channel in the memory into a tga or 32-bit bmp file with the transparency channel by using a memory copying method;

ii) outputting the first-in first-out cache frame data locked in the memory into a video file with a specified format by using a built-in encoder of a Windows system or tools such as FFMpeg;

after the completion of the flow is confirmed, converting the video data in the memory into YUV signals or directly outputting RGB signals for playing the video by calling a video output interface;

in summary, as shown in fig. 7g, the data and cache structure according to the embodiment of the disclosure can be described as follows:

1: acquiring an original video by an acquisition card;

2: and the first-in first-out buffer is used for storing the video and dropping the video. Wherein the front and rear sections are preset extra video time lengths;

3: processing layer caches with splicing sequences in a positive sequence or a reverse sequence can be set, and the front-back layer relation between images can be adjusted for final synthesis output;

4: transparent channel image buffer, storing each processed image, and being used for final synthesis.

The technical scheme provided by the disclosure has the advantages that:

1) the system can be separated from operators in the running process, and automatic execution without human intervention is realized. After the calculation conditions are set, a series of actions including acquisition, triggering, calculation, fusion rendering, file storage and lossless video output can be automatically executed. The video processing system provided by the disclosure can directly save labor expenditure and avoid unnecessary cost loss except for improving the efficiency of the whole manufacturing process.

2) The system is simple in composition, and the deployment and construction of the whole scheme can be completed only by one high-performance workstation equipped with a high-definition display and a video acquisition card, a keyboard and a mouse and other input devices. After deployment is finished, the working process is simple and clear, switching among a plurality of pieces of software is not needed, all work can be finished after parameters are clicked under the same software interface, and various broadcasting accidents caused by improper manual operation are not easy to generate;

3) the image recognition accuracy is high, the processing time is short, the artificial intelligent motion analysis and the video processing which finish one-time air skill jump can be finished within seconds after the athlete finishes the motion, and uncompressed video source streams and lossless video original files are directly output, so that the method has the conditions and the capability of being applied to sports live broadcast;

4) the video processing process adopts a data layering processing mode, has the capacity of sufficient downward expansion and compatibility, and can realize the random combination of each logic layer and each presentation layer, thereby achieving the purpose of greatly enriching the broadcasting visual effect.

5) The embodiment of the disclosure can complete all workflows by using only one camera as an input source, and does not need complex multi-angle camera deployment.

Fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure, and as shown in fig. 8, the video processing apparatus may include:

a detection module 101, configured to detect positions of a moving object in a current video frame and a first video frame; wherein the first video frame comprises: a video frame preceding the current video frame in the target video segment; the target video clip is a video clip containing the moving target in a live video;

the prediction module 102 is configured to predict a position of the moving object in a second video frame according to positions of the moving object in the current video frame and the first video frame, so as to obtain a position prediction result; wherein the second video frame comprises: a video frame subsequent to the current video frame in the target video segment;

a cutting module 103, configured to cut out a target area image including the moving object from the second video frame according to the position prediction result;

a separation module 104, configured to separate a target image of the moving target from the target area image; the target image of the moving target is used for generating a time slicing special effect, and the time slicing special effect is used for presenting the moving states of the moving target at different time points on the same frame of picture.

In one embodiment, the apparatus further comprises:

and the determining module is used for determining a video segment of the moving target appearing in a preset area of the video frame as the target video segment according to the video frame of the moving target appearing in the live video.

In one embodiment, the determining module is specifically configured to:

determining a first candidate video frame in the live video; the first candidate video frame comprises the moving target entering a preset starting area;

determining a second candidate video frame in a plurality of video frames following the first candidate video frame; the second candidate video frame comprises the moving target moving to a preset ending area;

and acquiring the target video clip according to the first candidate video frame and the second candidate video frame.

In one embodiment, the prediction module 102 is specifically configured to:

predicting the motion track of the motion target according to the positions of the motion target in the current video frame and the first video frame;

and predicting the position of the moving target in the second video frame according to the motion track prediction result of the moving target to obtain a position prediction result.

In one embodiment, the cropping module 103 is specifically configured to:

determining a cutting frame of the moving object in the second video frame according to the position prediction result;

and cutting out a target area image containing the moving target from the second video frame according to the cutting frame of the moving target in the second video frame.

In one embodiment, the separation module 104 is specifically configured to:

extracting characteristic information of the moving target based on a preset deep learning model;

determining a foreground part of the target area image according to the characteristic information of the moving target;

determining transparency information of a plurality of pixel points in the target area image according to the foreground part of the target area image;

and acquiring a target image of the moving target according to the foreground part of the target area image and the transparency information of the plurality of pixel points.

In one embodiment, the apparatus further comprises:

the acquisition module is used for acquiring the pose data of the moving target according to the lens parameters used for shooting the live video and the target image of the moving target;

and the cache module is used for performing associated cache on the pose data of the moving target and the target image of the moving target.

In one embodiment, the apparatus further comprises:

and the synthesis module is used for synthesizing a plurality of target images of the moving target according to a plurality of pose data of the moving target to obtain a synthesized image, wherein the synthesized image is used for generating the time slice special effect.

In one embodiment, the live video is a video captured by a camera.

In some embodiments, the apparatus may further comprise:

and the acquisition module is used for acquiring the live broadcast video through a camera.

In some embodiments, the apparatus may further comprise:

and the output module is used for outputting a composite image synthesized by a plurality of target images based on the moving target.

With regard to the video processing apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the video processing method, and will not be elaborated here.

In an exemplary embodiment, the detection module 101, the prediction module 102, the clipping module 103, the separation module 104, and the like may be implemented by one or more Central Processing Units (CPUs), Graphics Processing Units (GPUs), Baseband Processors (BPs), Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors (GPUs), controllers, Micro Controllers (MCUs), microprocessors (microprocessors), or other electronic components for performing the aforementioned video Processing method.

Fig. 9 is a block diagram illustrating a structure of a computer device, which may be a server, according to an embodiment of the present disclosure. The computer device 800 shown in fig. 9 includes: at least one processor 801, memory 802, at least one network interface 803. The various components in the computer device 800 are coupled together by a bus system 804. It is understood that the bus system 804 is used to enable communications among the components for the connection. The bus system 804 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 804 in fig. 9.

It will be appreciated that the memory 802 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory.

The memory 802 in the disclosed embodiments is used to store various types of data to support the operation of the computer device 800. Examples of such data include: any computer program for operating on computer device 800, such as executable program 8021, can be included in executable program 8021 for implementing methods of embodiments of the present disclosure.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the video processing method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, perform the steps of:

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of video processing, the method comprising:

2. The method of claim 1, wherein the step of detecting the position of the moving object in the current video frame and the first video frame is preceded by the step of:

and determining a video clip of the moving target appearing in a preset area of the video frame as the target video clip according to the video frame of the moving target appearing in the live video.

3. The method according to claim 2, wherein the determining, as the target video segment, a video segment in which the moving object appears in a preset area of a video frame according to a video frame in which the moving object appears in the live video comprises:

determining a second candidate video frame in a plurality of video frames subsequent to the first candidate video frame; the second candidate video frame comprises the moving target moving to a preset ending area;

4. The method of claim 1, wherein predicting the position of the moving object in the second video frame according to the positions of the moving object in the current video frame and the first video frame to obtain a position prediction result comprises:

5. The method according to claim 1, wherein said cropping out an image of a target area containing the moving object from the second video frame according to the position prediction result comprises:

6. The method of claim 1, wherein separating the target image of the moving target from the target area image comprises:

7. The method of claim 1, further comprising:

acquiring pose data of the moving target according to lens parameters used for shooting the live video and a target image of the moving target;

and performing association cache on the pose data of the moving target and the target image of the moving target.

8. The method of claim 7, further comprising:

9. The method of any one of claims 1 to 8, wherein the live video is a video captured by a camera.

10. A video processing apparatus, characterized in that the apparatus comprises:

the detection module is used for detecting the positions of the moving target in the current video frame and the first video frame; wherein the first video frame comprises: a video frame in the target video clip prior to the current video frame; the target video clip is a video clip containing the moving target in a live video;

11. A computer device, comprising:

the processor is configured to execute the executable instructions, and the executable instructions perform the steps of the video processing method according to any one of claims 1 to 9.

12. A non-transitory computer-readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, implement the steps in the video processing method according to any one of claims 1 to 9.