CN108805898B

CN108805898B - Video image processing method and device

Info

Publication number: CN108805898B
Application number: CN201810551722.4A
Authority: CN
Inventors: 吴兴龙
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Douyin Vision Co Ltd; Beijing Volcano Engine Technology Co Ltd; Douyin Vision Beijing Co Ltd
Priority date: 2018-05-31
Filing date: 2018-05-31
Publication date: 2020-10-16
Anticipated expiration: 2038-05-31
Also published as: CN108805898A

Abstract

The application provides a video image processing method and a video image processing device, wherein the method comprises the following steps: acquiring a current frame video image; performing image segmentation on the current frame video image to obtain a first mask image corresponding to the current frame video image; determining historical motion information of the current frame video image according to the current frame video image and the previous frame video image, and obtaining a third mask image corresponding to the current frame video image according to a second mask image and the historical motion information corresponding to the previous frame video image; and calculating according to the historical motion information to obtain fusion weight, and performing weighted fusion on the first mask image and the third mask image according to the fusion weight to obtain a fusion mask image of the current frame video image. The video image processing method and the video image processing device can avoid the problems of jitter and delay, improve the stability and the fluency of the video and improve the accuracy of motion tracking.

Description

Video image processing method and device

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a video image processing method and apparatus.

Background

With the popularization and development of various video software applications, various video processing algorithms are widely applied to the processing of various video images. Among them, the video segmentation technique is widely used as a basic video processing means.

Conventional video segmentation techniques generally employ image segmentation, i.e., each image is masked by a segmentation mask. However, since the frames before and after the image may not be consistent, the segmentation method often generates significant jitter, resulting in poor video stability.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a video image processing method and apparatus to improve the stability and smoothness of a video.

A video image processing method, said method comprising the steps of:

acquiring a current frame video image;

performing image segmentation on the current frame video image by adopting a convolutional neural network to obtain a first mask image corresponding to the current frame video image;

determining historical motion information of the current frame video image according to the current frame video image and a previous frame video image of the current frame video image, and obtaining a third mask image corresponding to the current frame video image according to a second mask image corresponding to the previous frame video image and the historical motion information of the current frame video image;

and calculating to obtain a fusion weight according to the historical motion information of the current frame video image, and performing weighted fusion on the first mask image and the third mask image according to the fusion weight to obtain a fusion mask image of the current frame video image.

In one embodiment, the step of determining the historical motion information of the current frame video image according to the current frame video image and the previous frame video image of the current frame video image comprises:

determining optical flow information of the current frame video image according to the current frame video image and a previous frame video image of the current frame video image, wherein the optical flow information is used for representing historical motion information of the current frame video image, and the optical flow information comprises a horizontal pixel offset and a vertical pixel offset of each pixel in the current frame video image.

In an embodiment, the step of obtaining a third mask image corresponding to the current frame video image according to the second mask image corresponding to the previous frame video image and the historical motion information of the current frame video image includes:

respectively superposing the horizontal pixel offset of each pixel of the current frame video image and the horizontal pixel of the second mask image, and calculating to obtain the horizontal pixel of the third mask image;

and respectively superposing the vertical pixel offset of each pixel of the current frame video image and the vertical pixel of the second mask image, and calculating to obtain the vertical pixel of the third mask image.

In one embodiment, the step of obtaining the fusion weight according to the historical motion information of the current frame video image comprises:

calculating according to the optical flow information of the current frame video image, a preset first parameter and a second parameter to obtain a first reference value;

comparing the first reference value with a preset second reference value, and taking the minimum value of the first reference value and the second reference value as the fusion weight;

the first parameter and the second parameter are constants, and the first reference value and the second reference value are both greater than zero and less than 1.

In one embodiment, the step of calculating a first reference value according to the optical flow information of the current frame video image, a preset first parameter and a second parameter includes:

taking a natural constant e as a base number, and taking the product of the optical flow information of the current frame video image and the preset first parameter as an index to perform exponential operation to obtain a third parameter;

and calculating to obtain the first reference value according to the third parameter and the preset second parameter.

In one embodiment, the second reference value has a value range of [0.8,0.95], the first parameter has a value range of [4,6], and the second parameter has a value range of [0.6,0.9 ].

In an embodiment, the step of performing weighted fusion on the first mask image and the third mask image according to the fusion weight to obtain a fusion mask image of the current frame video image includes:

and taking the fusion weight as a first weight corresponding to the first mask image, taking the difference between a preset total weight and the fusion weight as a second weight corresponding to the third mask image, and performing weighted fusion on the first mask image and the third mask image according to the first weight and the second weight.

In one embodiment, the present invention also provides a video image processing apparatus, comprising:

the acquisition module is used for acquiring a current frame video image;

the first segmentation module is used for carrying out image segmentation on the current frame video image to obtain a first mask image corresponding to the current frame video image;

the smoothing module is used for determining historical motion information of the current frame video image according to the current frame video image and a previous frame video image of the current frame video image, and obtaining a third mask image corresponding to the current frame video image according to a second mask image corresponding to the previous frame video image and the historical motion information;

and the fusion module is used for calculating to obtain fusion weight according to the historical motion information of the current frame video image, and performing weighted fusion on the first mask image and the third mask image according to the fusion weight to obtain a fusion mask image of the current frame video image.

In one embodiment, the present invention further provides an electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.

Furthermore, the invention also provides a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method of any of the above.

According to the video image processing method and device, the current video image is segmented to obtain the first mask image of the current frame video image, the third mask image corresponding to the current frame video image is obtained according to the second mask image corresponding to the previous frame video image and the historical motion information of the current frame video image, the fusion weight of each pixel in the current frame video image is obtained through calculation according to the historical motion information of the current frame video image, and therefore the first mask image and the third mask image are fused according to the fusion weight, and the fusion mask image of the current frame video image is obtained. According to the video image processing method and device, the fusion weight is obtained through calculation of the historical motion information of the current frame video image, the first mask image and the third mask image are fused, the problem of shaking and the problem of delay caused by the phenomenon that the front frame and the rear frame of the video image are inconsistent can be avoided, and the stability and the smoothness of the video are improved. Meanwhile, by adopting the video image segmentation and fusion method, the specific features in the video image can be accurately identified, so that the accuracy of motion tracking of the specific features can be improved.

Drawings

FIG. 1 is a diagram illustrating an exemplary embodiment of a video image processing method;

FIG. 2 is a flow diagram illustrating a video image processing method according to one embodiment;

FIG. 3 is a flowchart illustrating a step of obtaining a third mask image corresponding to a current frame video image according to historical motion information in an embodiment;

FIG. 4 is a flowchart illustrating a process of obtaining a third mask image corresponding to a current frame video image according to a second mask image and historical motion information in an embodiment;

FIG. 5 is a step of performing weighted fusion of the first mask image and the third mask image according to the calculated fusion weight in one embodiment;

FIG. 6 is a block diagram showing the structure of a video image processing apparatus according to an embodiment;

FIG. 7 is a block diagram of a video image processing apparatus according to another embodiment;

FIG. 8 is a diagram of the internal structure of an electronic device in one embodiment;

FIG. 9 is a diagram of an image before and after segmentation using a convolutional neural network, according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The video image processing method provided by the application can be applied to the application environment shown in fig. 1. In which a terminal 101 communicates with a server 102 via a network. Specifically, the video image processing method may be applied to the server 102 described above, or to the terminal 101 (e.g., in a video software application installed on the terminal 101). The terminal 101 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 102 may be implemented by an independent server or a server cluster formed by a plurality of servers. The following is an example of the application of the above method to a video software application installed on the terminal 101:

for example, the video software on the terminal 101 may obtain a certain video file from the server 102 through the network to play, and in the process of the video file, the terminal may obtain the video image of the current frame in real time, and perform image segmentation on the video image of the current frame to obtain the first mask image corresponding to the current video frame image, and correspondingly store the first mask image. Meanwhile, the terminal may determine the historical motion information of the current frame video image from the current frame video image and a previous frame video image of the current frame video image, for example, the historical motion information of the current frame video image may be optical flow information. And then, the terminal can obtain a third mask image of the current frame video image according to the second mask image corresponding to the previous frame video image and the historical motion information of the current frame video image.

Furthermore, the terminal can calculate and obtain fusion weight according to the historical motion information of the current frame video image, and perform weighted fusion on the first mask image and the third mask image according to the fusion weight to obtain a fusion mask image of the current frame video image, so that image segmentation of the current frame video image is realized. According to the video image processing method in the embodiment of the application, the fusion weight is obtained through calculation of the historical motion information of the current frame video image, and the first mask image and the third mask image are fused, so that the problem of shaking and delay caused by the phenomenon that the front frame and the rear frame of the video image are inconsistent can be avoided, and the stability and the fluency of the video are improved. Meanwhile, by adopting the video image segmentation and fusion method, the specific features in the video image can be accurately identified, so that the accuracy of motion tracking of the specific features can be improved.

In an embodiment, as shown in fig. 2, the video image processing method according to the embodiment of the present application is used for segmenting and fusing video images to improve stability and smoothness of the video. The method comprises the following steps:

s100, acquiring a current frame video image; specifically, the current frame video image may be a current frame of a video file being played. Further, the video file may be an offline video file stored in a memory on the terminal, or an online video file acquired by the terminal from a server. For example, the video file may be an online video file acquired by a terminal (e.g., a mobile phone) from a server, and at this time, when a user requests to play a specified video file through video software on the mobile phone, the terminal may transmit the video playing request to the server through a network, and the server may return a playing address and the like of the specified video file, so that the specified video file may be played on the mobile phone. In the playing process of the video file, the terminal can acquire the current frame video image of the current appointed video file in real time.

S200, performing image segmentation on the current frame video image to obtain a first mask image corresponding to the current frame video image; specifically, a convolutional neural network may be used to perform image segmentation on the current frame video image. Further, the Convolutional Neural network in the embodiment of the present application may adopt a conventional CNN (Convolutional Neural network), that is, in order to classify a pixel, an image block around the pixel is used as an input of the CNN for training and prediction. Optionally, the process of performing image segmentation on the current frame video image by using the convolutional neural network includes identifying each object in the current image by using the convolutional neural network, classifying the identified objects, and finally performing a series of operations such as detection and pooling on the edge of the target object to obtain a mask image of the target object, so as to distinguish the target object from other objects in the current frame video image. Optionally, the Convolutional neural network in the embodiment of the present application may also adopt FCN (full Convolutional neural network). The first mask image obtained before and after image segmentation of the current frame video image can be seen in fig. 9.

S300, determining historical motion information of the current frame video image according to the current frame video image and the previous frame video image of the current frame video image, and obtaining a third mask image corresponding to the current frame video image according to the second mask image corresponding to the previous frame video image and the historical motion information of the current frame video image. Specifically, in this embodiment of the present application, the second mask image corresponding to the previous video image may be a mask image obtained by calculation according to a convolutional neural network, or a fused mask image obtained by mutually fusing a mask image obtained by calculation according to a convolutional neural network and a mask image obtained by historical motion information. Furthermore, the historical motion information of the current frame video image is used for representing the difference between the current frame video image and the previous frame video image, that is, the historical motion information of the current frame video image is used for representing the offset of each pixel in the current frame video image, so that the problem of inconsistency of the previous frame video image and the next frame video image can be avoided by combining the historical motion information of the current frame video image and the second mask image corresponding to the previous frame video image, and the problem of image jitter is further avoided. Alternatively, the above-mentioned historical motion information may be represented using optical flow information.

S400, calculating according to historical motion information of the current frame video image to obtain fusion weight, and fusing the first mask image and the third mask image according to the fusion weight to obtain a fusion mask image of the current frame video image. Optionally, a first weight corresponding to the first mask image and a second weight corresponding to the third mask image may be determined according to the fusion weight, so that the first mask image and the third mask image may be subjected to weighted fusion according to the first weight and the second weight, and the fusion mask image of the current frame video image is obtained through calculation, thereby implementing image segmentation of the current frame video image.

According to the video image processing method in the embodiment of the application, the fusion weight is obtained through calculation of the historical motion information of the current frame video image, and the first mask image and the third mask image are fused, so that the problem of shaking and delay caused by the phenomenon that the front frame and the rear frame of the video image are inconsistent can be avoided, and the stability and the fluency of the video are improved. Meanwhile, by adopting the video image segmentation and fusion method, the specific features in the video image can be accurately identified, so that the accuracy of motion tracking of the specific features can be improved.

In one embodiment, as shown in fig. 3, the step S300 may include:

s310, determining optical flow information of the current frame video image according to the current frame video image and a previous frame video image of the current frame video image, wherein the optical flow information is used for representing historical motion information of the current frame video image, and the optical flow information of the current frame video image comprises a horizontal pixel offset and a vertical pixel offset of each pixel in the current frame video image. Specifically, the optical flow information is used to represent a correspondence relationship between a current frame video image and a previous frame video image, that is, historical motion information of the current frame video image in the embodiment of the present application is obtained by adopting an optical flow method. Where the optical flow is the "instantaneous velocity" of pixel motion of a spatially moving object on the observation imaging plane. The optical flow method is a method for finding out the corresponding relationship between a current frame video image and a previous frame video image by using the change of pixels in an image sequence in a time domain and the correlation between adjacent frames and calculating the motion information of an object between the adjacent frames. The above-mentioned horizontal pixel shift amount may be a horizontal motion velocity value of the pixel, and the above-mentioned vertical pixel shift amount may be a vertical motion velocity value of the pixel. Alternatively, a sparse optical flow method or a dense optical flow method may be employed in the embodiments of the present application. In the embodiment of the application, optical flow information can be obtained by computing the interface calcd optical flow Farnenback provided by Opencv (OpenCV is a cross-platform computer vision library issued based on BSD license (open source) and can run on Linux, Windows, Android and Mac OS operating systems).

S320, obtaining a third mask image corresponding to the current frame video image according to the second mask image corresponding to the previous frame video image and the optical flow information of the current frame video image. The third mask image corresponding to the current frame video image is obtained by utilizing the historical motion information of the pixels in the image, and the problem of video jitter caused by the inconsistency of the previous frame and the next frame can be avoided by considering the dynamic motion characteristics of the pixels in the image.

In one embodiment, as shown in fig. 4, the step S320 may further include:

s321, overlapping the horizontal pixel offset of each pixel of the current frame video image with the horizontal pixel of the second mask image respectively, and calculating to obtain the horizontal pixel of the third mask image;

and S322, respectively superposing the vertical pixel offset of each pixel of the current frame video image and the vertical pixel of the second mask image, and calculating to obtain the vertical pixel of the third mask image.

Specifically, the third mask image M_t2(i，j)＝M_t-1(i + F (i, j,0), j + F (i, j, 1)), where M_t2(i, j) represents a pixel of the third mask image corresponding to the current frame video image, M_t-1(i, j) represents a pixel of the second mask image corresponding to the previous frame of the video image, F (i, j,0) represents a horizontal pixel shift amount of the pixel, F (i, j,1) represents a vertical pixel shift amount of the pixel, i represents a horizontal pixel, and j represents a vertical pixel.

In one embodiment, as shown in fig. 5, the step S400 may include:

s410, calculating according to optical flow information of a current frame video image, a preset first parameter and a preset second parameter to obtain a first reference value; specifically, optical flow information of a current frame video image is obtained according to a horizontal pixel offset and a vertical pixel offset of each pixel, and then a first reference value is obtained extremely according to the optical flow information of the current frame video image, a preset first parameter and a preset second parameter. Alternatively, the optical flow information F (i, j) ═ F (i, j,0) × F (i, j,0) + F (i, j,1) × F (i, j,1) of the current frame video image, where F (i, j,0) represents the horizontal pixel shift amount of the current pixel, F (i, j,1) represents the vertical pixel shift amount of the current pixel, i represents the horizontal pixel, and j represents the vertical pixel. The above-mentioned horizontal pixel shift amount may be a horizontal motion velocity value of the pixel, and the above-mentioned vertical pixel shift amount may be a vertical motion velocity value of the pixel.

Further, a product of the optical flow information of the current frame video image and the first parameter may be obtained by performing a multiplication operation, and then a natural constant e is used as a base number, and an exponential operation is performed on the product as an exponent to obtain a third parameter, and a first reference value is obtained by calculating according to the third parameter and a preset second parameter. Furthermore, the reciprocal of the exponential operation result and a preset second parameter are subjected to multiplication operation to obtain a first reference value.

And S420, comparing the first reference value with a preset second reference value, and taking the minimum value of the first reference value and the second reference value as the fusion weight of the current pixel, wherein the first parameter and the second parameter are constants, and both the first reference value and the second reference value are greater than zero and less than 1. According to the video image processing method in the embodiment of the application, the fusion weight is obtained through historical motion information calculation, and the first mask image and the third mask image are fused by using the fusion weight, so that the problem of shaking and delay caused by the phenomenon that the front frame and the rear frame of the video image are inconsistent can be avoided, and the stability and the fluency of the video are improved. Meanwhile, by adopting the video image segmentation and fusion method, the specific features in the video image can be accurately identified, so that the accuracy of motion tracking of the specific features can be improved.

Alternatively, the fusion weight W (i, j) ═ min (r1, 1-1/e ^ ((r2 × F (i, j)))) r3), r1 denotes the second reference value, r2 denotes the first parameter, r3 denotes the second parameter, and (1-1/e ^ ((r2 × F (i, j)))) r3) denotes the first reference value, where F (i, j) denotes optical flow information, i.e., the motion velocity of the pixel. Further, the value range of the second reference value may be [0.8,0.95], the value range of the first parameter may be [4,6], and the value range of the second parameter may be [0.6,0.9 ].

In one embodiment, the step S400 may further include:

and taking the fusion weight as a first weight corresponding to the first mask image, taking the difference between the preset total weight and the fusion weight as a second weight corresponding to the third mask image, and performing weighted fusion on the first mask image and the third mask image according to the first weight and the second weight. Specifically, the first weight is equal to the fusion weight W (i, j), and the second weight is (1-W (i, j)).

At this time, the fusion mask image M (i, j) ═ W (i, j) × M_t1(i，j)+(1-W(i，j))*M_t2(i, j) wherein M_t1(i, j) represents a first mask image corresponding to the current frame video image; m_t2And (i, j) represents a third mask image corresponding to the current frame video image.

For example, when the second reference value r1 is 0.9, the first parameter r2 is 5, and the second parameter r3 is 0.8, the first reference value is (1-1/e ^ ((r2 ^ F (i, j)))) r3, and at this time, it is necessary to compare whether the first reference value is less than the second reference value 0.9, and if the first reference value is less than 0.9, the first reference value is used as the fusion weight, and the fusion mask image is calculated according to the above-mentioned weight calculation method.

And if the first reference value is larger than 0.9, taking the second reference value as a fusion weight, and calculating according to the weighting calculation mode to obtain a fusion mask image. At this time, the first weight corresponding to the first mask image is 0.9, the second weight corresponding to the third mask image is 0.1, and the fused mask image M (i, j) is 0.9 × M_t1(i,j)+0.1*M_t2(i,j)。

When the second reference value r1 is equal to 1 and the second parameter r3 is equal to 0, the first reference value 1-1/e ^ ((r2 ^ F (i, j))) > r3 is equal to 1, at this time, the fusion weight W (i, j) is equal to 1, at this time, the fusion mask image is the first mask image, which is equivalent to not performing the smoothing process on the first mask image obtained by the convolution calculation.

When the second reference value r1 is 0.5, the first parameter r2 is 0, and the second parameter r3 is 0.5, the first reference value 1-1/e ^ ((r2 ^ F (i, j))). r3 is 0.5, when the fusion weight W (i, j) is 0.5, and the fusion mask image M (i, j) ═ 0.5M_t1(i，j)+0.5*M_t2(i, j) corresponds to average smoothing.

It should be understood that although the various steps in the flow charts of fig. 2-5 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-5 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 6, the present application further provides a video image processing apparatus, which includes an obtaining module 100, a first segmentation module 200, a smoothing module 300, and a fusion module 400. The obtaining module 100 is configured to obtain a current frame video image; the first segmentation module 200 is configured to perform image segmentation on the current frame video image to obtain a first mask image corresponding to the current frame video image; specifically, in the embodiment of the present application, a conventional CNN (Convolutional Neural Networks) may be used as the Convolutional Neural network, that is, in order to classify one pixel, an image block around the pixel is used as an input of the CNN for training and prediction, and a specific implementation manner may refer to a method for implementing image segmentation by using the conventional CNN. Optionally, the convolutional neural network in the embodiment of the present application may also adopt FCN (full convolutional neural network). The first mask image obtained before and after image segmentation of the current frame video image can be seen in fig. 9.

The smoothing module 300 is configured to determine historical motion information of the current frame video image according to the current frame video image and a previous frame video image of the current frame video image, and obtain a third mask image corresponding to the current frame video image according to a second mask image corresponding to the previous frame video image and the historical motion information of the current frame video image; specifically, in this embodiment of the present application, the second mask image corresponding to the previous video image may be a mask image obtained by calculation according to a convolutional neural network, or a fused mask image obtained by mutually fusing a mask image obtained by calculation according to a convolutional neural network and a mask image obtained by historical motion information. Furthermore, the historical motion information of the current frame video image is used for representing the difference between the current frame video image and the previous frame video image, so that the problem of inconsistency of the previous frame video image and the next frame video image can be avoided by combining the historical motion information of the current frame video image and the second mask image corresponding to the previous frame video image, and the problem of image jitter is further avoided. Alternatively, the above-mentioned historical motion information may be represented using optical flow information.

The fusion module 400 is configured to calculate to obtain a fusion weight according to the historical motion information, and fuse the first mask image and the third mask image according to the fusion weight to obtain a fusion mask image of the current frame video image. In the embodiment of the application, a first weight corresponding to the first mask image and a second weight corresponding to the third mask image can be determined according to the fusion weight, so that the first mask image and the third mask image can be subjected to weighted fusion according to the first weight and the second weight, the fusion mask image of the current frame video image is obtained through calculation, and the image segmentation of the current frame video image is realized.

According to the video image processing device in the embodiment of the application, the fusion weight is obtained through the calculation of the historical motion information of the current frame video image, and the first mask image and the third mask image are fused, so that the problem of shaking and delay caused by the phenomenon that the front frame and the rear frame of the video image are inconsistent can be avoided, and the stability and the smoothness of the video are improved. Meanwhile, by adopting the video image segmentation and fusion method, the specific features in the video image can be accurately identified, so that the accuracy of motion tracking of the specific features can be improved.

In one embodiment, as shown in fig. 7, the smoothing module 300 may include a speed calculation unit 310 and a smoothing unit 320. The speed calculation unit 310 is configured to calculate and obtain optical flow information according to the current frame video image and a previous frame video image of the current frame video image, where the optical flow information is used to represent historical motion information of the current frame video image, and the optical flow information includes a horizontal pixel offset and a vertical pixel offset of each pixel in the current frame video image.

In an embodiment, the smoothing unit 320 is configured to separately superimpose the horizontal pixel offset of each pixel of the current frame video image and the horizontal pixel of the second mask image, and calculate to obtain the horizontal pixel of the third mask image; and respectively superposing the vertical pixel offset of each pixel of the current frame video image and the vertical pixel of the second mask image, and calculating to obtain the vertical pixel of the third mask image.

In one embodiment, as shown in fig. 7, the fusion module 400 further includes a weight calculation unit 410 and a fusion unit 420. The weight calculation unit 410 is configured to calculate and obtain a first reference value according to the optical flow information, a preset first parameter and a second parameter; and comparing the first reference value with a preset second reference value, and taking the minimum value of the first reference value and the second reference value as a fusion weight, wherein the first parameter and the second parameter are constants, and the first reference value and the second reference value are both greater than zero and less than 1. The fusion unit 420 is configured to use the fusion weight as a first weight corresponding to the first mask image, use a difference between a preset total weight and the fusion weight as a second weight corresponding to the third mask image, and perform weighted fusion on the first mask image and the third mask image according to the first weight and the second weight.

In an embodiment, the weight calculating unit 410 is specifically configured to perform an exponential operation by using a natural constant e as a base number and using a product of the optical flow information of the current frame video image and the preset first parameter as an exponent to obtain a third parameter;

Optionally, the fusion weight W (i, j) ═ min (r1, 1-1/e ^ ((r2 ^ F (i, j))). r 3);

where r1 represents the second reference value, r2 represents the first parameter, r3 represents the second parameter, (1-1/e ^ ((r2 x F (i, j))). r3) represents the first reference value, and F (i, j) represents optical flow information.

For specific limitations of the video image processing apparatus, reference may be made to the above limitations of the video image processing method, which are not described herein again. The respective modules in the video image processing apparatus described above may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the electronic device, or can be stored in a memory in the electronic device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, an electronic device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 8. The electronic device comprises a processor, a memory, a network interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a video processing method. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the structure shown in fig. 8 is a block diagram of only a portion of the structure relevant to the present disclosure, and does not constitute a limitation on the electronic device to which the present disclosure may be applied, and that a particular electronic device may include more or less components than those shown, or combine certain components, or have a different arrangement of components.

Specifically, the electronic device may include a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:

acquiring a current frame video image;

performing image segmentation on the current frame video image to obtain a first mask image corresponding to the current frame video image;

and calculating to obtain a fusion weight according to the historical motion information of the current frame video image, and fusing the first mask image and the third mask image according to the fusion weight to obtain a fusion mask image of the current frame video image.

In one embodiment, when the processor executes the step of determining the historical motion information of the current frame video image according to the current frame video image and the previous frame video image of the current frame video image, the following steps are specifically executed:

In an embodiment, when the processor executes the step of obtaining the third mask image corresponding to the current frame video image according to the second mask image corresponding to the previous frame video image and the historical motion information of the current frame video image, the following steps are specifically executed:

In one embodiment, when the processor performs the step of obtaining the fusion weight according to the historical motion information of the current frame video image, the following steps are specifically performed:

calculating according to the optical flow information, a preset first parameter and a second parameter to obtain a first reference value;

and comparing the first reference value with a preset second reference value, and taking the minimum value of the first reference value and the second reference value as the fusion weight, wherein the first parameter and the second parameter are constants, and the first reference value and the second reference value are both greater than zero and less than 1.

In one embodiment, the fusion weight W (i, j) ═ min (r1, 1-1/e ^ ((r2 ^ F (i, j))) > r 3);

wherein r1 represents the second reference value, r2 represents the first parameter, r3 represents the second parameter, (1-1/e ^ ((r2 x F (i, j)))) r3 represents the first reference value; f (i, j) represents optical flow information.

In an embodiment, when the processor performs the step of performing weighted fusion on the first mask image and the third mask image according to the fusion weight to obtain the fusion mask image of the current frame video image, the following steps are specifically performed:

It should be clear that, in the embodiment of the present application, the process of implementing the video image segmentation and smoothing processing by the electronic device is consistent with the execution process of the video image processing method described above, and specific reference may be made to the description above.

Furthermore, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following steps:

acquiring a current frame video image;

In one embodiment, when the computer program is executed by the processor to implement the step of determining the historical motion information of the current frame video image according to the current frame video image and the previous frame video image of the current frame video image, the following steps are specifically implemented:

In an embodiment, when the computer program is executed by the processor to implement the step of obtaining the third mask image corresponding to the current frame video image according to the second mask image corresponding to the previous frame video image and the historical motion information, the following steps are specifically implemented:

In one embodiment, when the computer program is executed by the processor to implement the step of obtaining the fusion weight according to the historical motion information, the following steps are specifically implemented:

In one embodiment, when the computer program is executed by the processor to calculate and obtain the first reference value according to the optical flow information of the current frame video image, the preset first parameter and the second parameter, the following steps are specifically implemented:

In an embodiment, when the computer program is executed by the processor to implement the step of performing weighted fusion on the first mask image and the third mask image according to the fusion weight to obtain the fusion mask image of the current frame video image, the following steps are specifically implemented:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

According to the video image processing method and device, the current video image is segmented to obtain the first mask image of the current frame video image, the third mask image corresponding to the current frame video image is obtained according to the second mask image corresponding to the previous frame video image and the historical motion information of the current frame video image, the fusion weight of each pixel in the current frame video image is obtained through calculation according to the historical motion information of the current frame video image, and therefore the first mask image and the third mask image are fused according to the fusion weight, and the fusion mask image of the current frame video image is obtained. According to the video image processing method and device, the fusion weight is obtained through calculation of the historical motion information of the current frame video image, the first mask image and the third mask image are fused, the problem of shaking and the problem of delay caused by the phenomenon that the front frame and the rear frame of the video image are inconsistent can be avoided, and the stability and the smoothness of the video are improved.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for processing video images, said method comprising the steps of:

acquiring a current frame video image;

calculating to obtain a fusion weight according to the historical motion information of the current frame video image, and fusing the first mask image and the third mask image according to the fusion weight to obtain a fusion mask image of the current frame video image;

wherein the step of determining the historical motion information of the current frame video image according to the current frame video image and the previous frame video image of the current frame video image comprises:

2. The method according to claim 1, wherein the step of obtaining a third mask image corresponding to the current frame video image according to the second mask image corresponding to the previous frame video image and the historical motion information of the current frame video image comprises:

superposing the horizontal pixel offset of each pixel of the current frame video image with the horizontal pixel of the second mask image, and calculating to obtain the horizontal pixel of the third mask image;

and superposing the vertical pixel offset of each pixel of the current frame video image and the vertical pixel of the second mask image, and calculating to obtain the vertical pixel of the third mask image.

3. The method according to claim 1, wherein the step of obtaining the fusion weight according to the historical motion information of the current frame video image comprises:

4. The method according to claim 3, wherein the step of calculating the first reference value according to the optical flow information of the current frame video image, the preset first parameter and the second parameter comprises:

5. The method according to claim 3 or 4, wherein the second reference value is in the range of [0.8,0.95], the first parameter is in the range of [4,6], and the second parameter is in the range of [0.6,0.9 ].

6. The method according to claim 1, wherein the step of performing weighted fusion on the first mask image and the third mask image according to the fusion weight to obtain the fusion mask image of the current frame video image comprises:

7. A video image processing apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring a current frame video image;

the smoothing module is used for determining historical motion information of the current frame video image according to the current frame video image and a previous frame video image of the current frame video image, and obtaining a third mask image corresponding to the current frame video image according to a second mask image corresponding to the previous frame video image and the historical motion information of the current frame video image;

the fusion module is used for calculating to obtain fusion weight according to the historical motion information of the current frame video image, and performing weighted fusion on the first mask image and the third mask image according to the fusion weight to obtain a fusion mask image of the current frame video image;

wherein, the determining the historical motion information of the current frame video image according to the current frame video image and the previous frame video image of the current frame video image comprises:

8. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.