CN109685802B

CN109685802B - Low-delay video segmentation real-time preview method

Info

Publication number: CN109685802B
Application number: CN201811527499.6A
Authority: CN
Inventors: 巩晓雅; 邬静云; 刘国良
Original assignee: Luzhou Hemiao Communication Technology Co ltd
Current assignee: Luzhou Hemiao Communication Technology Co ltd
Priority date: 2018-12-13
Filing date: 2018-12-13
Publication date: 2023-09-15
Anticipated expiration: 2038-12-13
Also published as: CN109685802A

Abstract

The application provides a low-delay video segmentation real-time preview method, which comprises the following steps: processing key frames of the video stream by adopting a network structure based on deep learning to obtain a second image segmentation result; processing transition frames among key frames of the video stream by adopting a gray projection algorithm to obtain a second image segmentation result of the transition frames; and adopting a low-delay display strategy to display the first image segmentation result in real time. The network structure for deep learning processes the key frames of the video stream, so that the image segmentation can be accurately carried out; the transition frames among the key frames of the video stream are processed by adopting a gray projection algorithm, so that the similarity among the video frames can be utilized to rapidly propagate the segmentation result of the previous frame, the segmentation time of each frame is shorter, and the fluency of the video is ensured; and combining the accurate result obtained by processing the key frames by the network structure of the deep learning with a low-delay strategy of the video sequence, so that the video has no click feeling and hysteresis feeling.

Description

Low-delay video segmentation real-time preview method

Technical Field

The application relates to the technical field of video segmentation, in particular to a low-delay video segmentation real-time preview method.

Background

Image segmentation is an important component of computer vision, and has wide application in real life, such as tissue detection, disaster assessment, face-beautifying, intelligent mapping and the like in medical images. The video image segmentation refers to the process of obtaining a binary image by segmenting the foreground and the background of an object in each frame of image in a video, and has higher requirements on real-time performance due to the fact that the smoothness of the video is ensured. In recent years, the deep learning method has been developed rapidly, and in terms of precision, the deep learning method is greatly improved compared with the traditional method, so that the image segmentation method based on the deep learning gradually becomes a research hot spot. With the development of technology and the improvement of computing power of devices, many video segmentation applications are gradually deployed to mobile devices, especially to smart phones. However, the method based on deep learning is time-consuming, the computing power of the mobile device is low, and how to use the image segmentation method based on deep learning in the video on the mobile device can display the result of each frame of image segmentation in real time is a very challenging research topic.

Disclosure of Invention

Aiming at the problems that the computing capacity of the existing mobile equipment is low and the image segmentation method based on deep learning is used in the video applied to the mobile equipment to display the result of each frame of image segmentation in real time, the application provides a low-delay video segmentation real-time preview method which is used for solving the problems existing in the existing research subjects.

According to a first aspect of the present application, there is provided a low-delay video segmentation real-time preview method, comprising the steps of:

processing key frames of the video stream on the equipment by adopting a network structure based on deep learning to obtain a second image segmentation result;

carrying out translation vector calculation on transition frames among key frames of the video stream on the equipment by adopting a gray projection algorithm so as to obtain a second image segmentation result of the transition frames;

a low-delay display strategy is adopted to display a first image segmentation result in real time, wherein the first image segmentation result refers to the content of processing a second image segmentation result to be displayed on a screen in real time;

the key frames and transition frames are determined according to the computational capabilities of the device itself.

Further, the step of processing the key frames of the video stream on the device by the network structure based on the deep learning to obtain the second image segmentation result is as follows:

extracting an original image from a video stream;

performing convolution operation on the original image to obtain low-level features of the original image;

performing dense hole convolution operation on the low-level features to obtain high-level features;

and decoding the low-level features and the high-level features to obtain corresponding second image segmentation results.

Further, the step of performing translation vector calculation on transition frames between key frames of the video stream on the device by using a gray projection algorithm to obtain a second image segmentation result of the transition frames includes:

the color images in the video stream are gray mapped in the G channel.

and searching the minimum value of the Euclidean distance of the line/column gray level projection curve between two frames of images at all positions by adopting a class binary search method.

Further, the searching for the minimum value of the euclidean distance of the line/column gray level projection curve between two frames of images in all positions by using the class binary search method means that:

step one, selecting three values among the effective N values, and taking the minimum value of the three values as a first center point;

step two, taking the first central point in the step one as a center, shortening the searching radius to be 1/2 of the first central point, selecting two values from the rest values, comparing the two values with the first central point, and selecting the minimum value in the three values as a second central point;

step three, taking the second center point in the step two as a center, shortening the searching radius to be 1/2 of the second center point, selecting two values from the rest values, comparing the two values with the second center point, and selecting the minimum value in the three values as a third center point;

and the like, until the residual numerical value is not more than three, comparing the residual numerical value with the central point in the previous step, and selecting the minimum value as the central point, wherein the central point is the minimum value.

Further, the use of the low-delay display strategy to display the first image segmentation result in real time means that:

the method comprises the steps that a strategy of displaying a look-ahead is used for a key frame, a process of waiting for the forward propagation of a key frame neural network is not suspended by a preview process, a rough result obtained through the propagation of a previous frame is used firstly, a complex operation process is transferred to a background to operate, an accurate result is obtained, and then the rough result is replaced with the accurate result in a time sequence; the preview process refers to real-time display content of a screen in the video stream transmission process; the process of the key frame neural network forward propagation refers to the process of processing key frames by using the network structure based on deep learning; the rough result is a second image segmentation result of the key frame, which is obtained by carrying out translation vector calculation on the previous frame of the key frame and the key frame by adopting a gray projection algorithm; the complex operation process refers to a process of forward propagation of a key frame neural network; the time sequence is a frame sequence formed by arranging each key frame and each transition frame according to the time sequence of the video stream when the video stream is transmitted; the accurate result refers to a second image segmentation result obtained after the key frame is processed by using the network structure based on the deep learning.

Compared with the prior art, the application has the beneficial effects that:

1. the application adopts the network structure of deep learning to process the key frames of the video stream on the equipment, and can accurately divide the images;

2. the transition frames among the key frames of the video stream on the equipment are processed by adopting a gray projection algorithm, so that the similarity among the video frames can be utilized to rapidly propagate the segmentation result of the previous frame, the segmentation time of each frame is shorter, and the fluency of the video is ensured;

3. and a low-delay strategy combining the accurate result obtained by processing the key frames of the video stream on the device by the deep-learning network structure with the time sequence, so that the video has no click feeling and no lag feeling.

Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow chart of a low-latency video segmentation live preview method in an embodiment of the present application;

FIG. 2 is a flow chart of processing key frames of a video stream on a device to obtain a second image segmentation result using a deep learning based network architecture in an embodiment of the present application;

FIG. 3 is a schematic diagram of a low-latency display strategy for displaying a first image segmentation result in real time according to an embodiment of the present application;

fig. 4 is a partial block diagram of a smart phone when the device is the smart phone according to an embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will further explain the technical solutions in the embodiments of the present application by referring to the figures in the embodiments of the present application.

In some of the flows described in the specification and claims of the present application and in the foregoing figures, a plurality of operations occurring in a particular order are included, but it should be understood that the operations may be performed out of order or performed in parallel, with the order of operations such as 11, 12, etc., being merely used to distinguish between the various operations, the order of the operations themselves not representing any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Example 1

As shown in fig. 1, a low-delay video segmentation real-time preview method according to an embodiment of the present application is provided, including the steps of:

the key frames and the transition frames in this embodiment are determined according to the computing capability of the device itself.

S1, processing key frames of a video stream on equipment by adopting a network structure based on deep learning to obtain a second image segmentation result;

since deep learning is proposed, a deep network is continuously developed towards a wider and deeper direction, and the precision is continuously improved by matching with mass data, so that a remarkable effect is obtained. The full convolution network (Fully Convolutional Networks, FCN) removes the full connection layer in the traditional neural network, realizes the classification of pixel level, and is a great breakthrough of deep learning in the field of image segmentation. On the basis of FCN, many scholars have proposed more large deep networks, such as UNet, segNet, deepLab series, and the like, with significant success. However, the large-scale deep network is based on massive data and huge computing overhead, and running the large-scale deep network on the mobile device consumes a lot of time, and cannot meet the requirement of the existing mobile device on 'faster and better' in practical application, so the processing of the video stream by the network structure based on deep learning in this embodiment is as follows:

as shown in fig. 2, the step of processing the key frames of the video stream on the device to obtain the second image segmentation result based on the deep learning network structure is as follows:

s11, extracting an original image from a video stream;

s12, carrying out convolution operation on the original image to obtain low-level features of the original image;

s13, performing dense hole convolution operation on the low-level features to obtain high-level features;

the advantage of dense hole convolution is that when the image needs global information, the computation amount is not increased, and the receptive field is increased, so that each convolution output contains a larger range of information.

S14 decodes the low-level features and the high-level features to obtain corresponding second image segmentation results, which refer to the class of each pixel of the image, in particular, in the front-background segmentation, a binary image.

The decoding process is a process of performing deconvolution operation on the low-level features and the high-level features based on the deep-learning network structure to obtain corresponding second image segmentation results.

S2, carrying out translation vector calculation on transition frames among key frames of the video stream on the equipment by adopting a gray projection algorithm so as to obtain a second image segmentation result of the transition frames.

A series of videos are analyzed frame by frame, and when the videos run smoothly, the correlation of two adjacent frames of images is larger, and for some mobile devices with not very high requirements on accuracy, the translational motion is enough to express the motion change between the two frames of images. The gray projection algorithm is a method for calculating translation vectors of two images, so that the gray projection algorithm is adopted to carry out image propagation after image segmentation between two adjacent frames.

The principle of the gray projection algorithm is briefly described as follows: for each frame of image in the video sequence, mapping the gray value of each frame of image into two independent one-dimensional waveforms to obtain projection curves of rows and columns of each frame, respectively carrying out correlation operation on the projection curves of two adjacent frames of images t and t+1, and obtaining translation vectors of the frames t and t+1 when the correlation is maximum, wherein the Euclidean distance is adopted for the correlation calculation.

Euclidean Distance (Euclidean Distance), also known as Euclidean metric, euclidean Distance, is a commonly used Distance definition, which is the true Distance between two points in an m-dimensional space; the euclidean distance in the two-dimensional space is the distance of a straight line segment between two points, and the smaller the euclidean distance is, the larger the correlation is represented.

In order to further improve the speed of the gray projection algorithm, the following two improvements are made:

first, gray mapping is carried out on color images in a video stream in a G channel; scientific researches show that the maximum photosensitivity of human eyes is located at 555nm, namely near green light, and color images in video streams can express enough row and column gray scale distribution of images in a G channel, and the time required by image graying is saved.

Secondly, searching the minimum value of the Euclidean distance of the line/column gray level projection curve between two frames of images at all positions by adopting a class binary search method; the gray projection algorithm adopts a global search method, so that the calculated amount is large. The line/column correlation curve between two frames of images has a single peak characteristic, so that the minimum value of the Euclidean distance of the line/column gray level projection curve between two frames of images at all positions can be searched by using a class binary search method, and the calculated amount can be greatly reduced.

Searching the minimum value of the Euclidean distance of the line/column gray level projection curve between two frames of images at all positions by adopting a class binary search method, wherein the minimum value is as follows:

uniformly selecting three points in the effective searching range, taking the minimum value of the three points as a central point, shortening 1/2 of the searching radius, and cycling the process again until convergence, wherein the obtained point is the minimum value, in other words:

and the like, until the residual values do not exceed three, selecting the minimum value as a central point, wherein the central point is the minimum value.

For example: three values were chosen uniformly among the available 9 values, which were 98, 68, 49, 21, 15, 16, 19, 45, 88 in turn. Firstly, uniformly selecting three values, namely selecting two end point values and a middle value, namely selecting 98, 15 and 88, and taking the minimum value of the three points as 15; then shortening the searching radius to 1/2 of the previous step, continuously and uniformly selecting three numbers from the rest numerical values by taking 15 as the center, namely selecting 49, 15 and 19, and taking the minimum value of the three points as 15; finally, shortening the searching radius to 1/2 of the previous step, continuously and uniformly selecting three numbers from the remaining numbers, wherein only 3 numbers, namely 21, 15 and 16, are left in the remaining numbers, and the minimum value in the three points is 15; the minimum value obtained by the search is 15.

S3, adopting a low-delay display strategy to display a first image segmentation result in real time, wherein the first image segmentation result refers to the content of processing a second image segmentation result to be displayed on a screen in real time.

The processing of the second image segmentation result to display the content of the screen in real time means that the processing of special effect editing of the video image to display the content of the screen in real time is performed on the second image segmentation result, wherein the special effect editing processing of the video image comprises road highlighting, background blurring and the like, and the special effect editing processing of the video image is set in advance according to application scenes.

In many real-time applications, low latency is very important. The method has the advantages that the quick gray projection algorithm is utilized to conduct the transmission of the image segmentation results among frames, the similarity among video frames can be utilized to conduct the quick transmission of the segmentation results of the previous frame, the segmentation time of each frame is short, but the key frames still need to be calculated through a complex neural network to obtain the image segmentation results, and thus the click feeling and the hysteresis feeling are caused.

To solve this problem, the present embodiment adopts a low-latency display strategy to display the first image segmentation result in real time, and the process is that: the method comprises the steps that a strategy of displaying a look-ahead is used for a key frame, a process of waiting for the forward propagation of a key frame neural network is not suspended by a preview process, a rough result obtained through the propagation of a previous frame is used firstly, a complex operation process is transferred to a background operation, an accurate result is obtained, and then the rough result is replaced with the accurate result in a time sequence; the preview process refers to real-time display content of a screen in the video stream transmission process; the process of the key frame neural network forward propagation refers to the process of processing the key frame by using a network structure based on deep learning; the rough result is a second image segmentation result of the key frame, which is obtained by carrying out translation vector calculation on the previous frame of the key frame and the key frame by adopting a gray projection algorithm; the complex operation process refers to the process of the forward propagation of the key frame neural network; the time sequence refers to a frame sequence formed by arranging each key frame and each transition frame according to the time sequence of the video stream when the video stream is transmitted; the accurate result refers to a second image segmentation result obtained after processing the key frame using the deep learning based network structure.

The above process can also be expressed as: the executing process of the low-delay display strategy comprises a process of forward propagation of the key frame neural network, but in the video display process, the process of forward propagation of the key frame neural network is firstly transferred to background operation to obtain a second image segmentation result, meanwhile, a frame on the key frame and a translation vector of the key frame are calculated to obtain a rough second image segmentation result (rough result) of the key frame, and the rough second image segmentation result (rough result) of the key frame is processed into a rough first image segmentation result and then is displayed. After the background operation is completed, the rough second image segmentation result is replaced by a second image segmentation result in the time sequence, as shown in fig. 3, wherein the mask is the second image segmentation result obtained by the segmentation network.

The terminal in this embodiment may be a terminal including a smart phone, a tablet pc, a PDA (Personal Digital Assistant ), etc., taking the smart phone as an example:

referring to fig. 4, a block diagram of a part of the structure of a smart phone includes a processor 401, a memory 402, an operating system 403, a bluetooth module 404, a display module 405, an audio processing module 406, a video processing module 407, a sensor module 408, a communication module 409, a wireless network module 410, a power module 411, a key module 412, an interface module 413, an input/output module 414, an RF circuit module 415, and a positioning module 416.

The processor 401 is a control center of the mobile phone, connects each module of the whole mobile phone through an interface and a line, and performs data processing by running or executing a built-in operating system 403, a software program and/or a module stored in the memory 402 and calling data stored in the memory 402, thereby performing various corresponding functions, and thus performing overall control of the mobile phone. Optionally, the processor 401 may include one or more processing units; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor is mainly applied in terms of an operating system 403, a user interface, and application programs, etc., and the modem processor is mainly applied in terms of wireless communication.

The memory 402 mainly includes a storage program area that can store an operating system 403 and application programs (such as a sound playing function, an image/video playing function, etc.) required for at least one function, and a storage data area; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device.

The operating system 403 is the kernel and the keystone of the system, and is also a program for managing system hardware and system software resources, such as managing and configuring memory, determining the priority of supply and demand of system resources, controlling input and output devices, connecting networks, managing file systems, and the like. For example, operating system 403 provides an operator interface for a user to interact with the system. The classic operating systems include an Android operating system and an iOS operating system.

The bluetooth module 404 is a PCBA board with integrated bluetooth function, specifically, a chip basic circuit set with integrated bluetooth function is used for short-distance wireless communication, and is divided into a bluetooth data module and a bluetooth voice module according to the functions.

The display module 405 includes a display 4051, the display 4051 typically being a liquid crystal display for displaying text, pictures, animations and video, the display 4051 having a touch function that when detected by a touch operation thereon or thereabout is communicated to the processor 401 to determine the type of touch event, and the processor 401 then provides a corresponding visual output on the display 4051 based on the type of touch event.

The audio processing module 406 includes a microphone 4061 and an audio processor 4062; generally, after collecting the sound signals, the microphone 4061 converts the collected sound signals into electrical signals, and the audio processor 4062 receives the electrical signals and converts the electrical signals into audio data; when audio data needs to be played, the audio processor 4062 converts the received audio data into an electrical signal, and then transmits the electrical signal to the microphone 4061, and the electrical signal is converted into a sound signal by the microphone 4061 and output.

The video processing module 407 includes a camera 4071 and a graphics processor; the camera 4071 is used for capturing images and videos, and the graphic processor 4072 processes the stored images or videos for noise cancellation, correction of distortion patterns, enhancement of sharpness, background blurring processing mentioned in the present application, and the like.

The sensor module 408 includes a variety of sensors such as light sensors, motion sensors, gyroscopes, barometers, hygrometers, thermometers, infrared sensors, and the like. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display screen according to the brightness of ambient light, and the proximity sensor may turn off the display screen and/or the backlight when the mobile phone moves to the ear. The acceleration sensor is used as one of the motion sensors, can detect the acceleration in all directions (generally three axes), can detect the gravity and the direction when the motion sensor is stationary, and can be used for recognizing the application of the gesture of the mobile phone, such as switching of a transverse screen and a vertical screen, gesture calibration of a magnetometer and the like.

The communication module 409 is used for processing and transmitting all message types such as information and voice, for example, receiving and sending information, making a call, and making a voice call through communication software.

A wireless network module 410 comprising a WiFi unit; the user accesses the internet, for example, e-mails, browses web pages, accesses streaming media, etc., through the wireless network module 410 of the cellular phone.

The power module 411 includes a battery 4111 and a power management system 4112, wherein the power management system 4112 is logically connected to the processor 401 to perform functions such as charging, discharging, and power consumption management of the battery 4111.

The key module 412 includes at least a power key and a volume up-down key; the power button controls the state of the power module 411 of the mobile phone; the volume up-down key is generally used for adjusting the volume of mobile phone audio/video and other media, and can also be used for adjusting the brightness and the darkness of the mobile phone; furthermore, the combination of the power key and the volume increasing and decreasing key can also be used for screen capturing, switching on/off, restarting, system restoration and the like of the mobile phone.

The interface module 413 includes a card connection unit, an earphone interface, a data interface and/or a power interface, where the card connection unit is used to insert a data card and a SIM card, and the data card can expand the storage space of the mobile phone; after the SIM card is inserted into the mobile phone, the user can dial and contact with the user holding other terminals inserted with the SIM card; the user can also connect with the network through the data flow of the SIM card; the earphone interface, the data interface and the power interface are different in the expression forms of different mobile phones, are integrated, are independent, and are partially overlapped and partially independent.

The input/output module 414 is configured to receive input digital or character information, and obtain information input by a user through the operating system 403 at an interface of the display screen.

The RF circuit module 415 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the RF circuit module 415 may also communicate with other devices through the wireless network module 410.

The positioning module 416 is used for positioning the current geographic position of the mobile phone to realize navigation or location-based services; the positioning module 416 is generally based on the location information of the GPS system (Global Positioning System) in the united states or the beidou system in china to position the mobile phone.

Those skilled in the art will appreciate that the structure shown in fig. 4 is not meant to be limiting, and may include more or less components than shown, or may combine certain components, or may employ a different arrangement of components, which is not described in detail herein.

In the embodiments provided by the present application, it should be understood that the described method may be implemented in other ways. For example, the above-described method embodiments are merely illustrative, the division of the method is merely a logic function division, there may be another division manner in actual implementation, and some or all units may be selected according to actual needs to achieve the purpose of the embodiment.

The foregoing is only a partial embodiment of the present application, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims

1. A low-delay video segmentation real-time preview method, comprising the steps of:

the key frames and the transition frames are determined according to the operation capability of the equipment;

the adoption of the low-delay display strategy to display the first image segmentation result in real time means that:

2. The method of claim 1, wherein the step of processing key frames of the video stream on the device by the deep learning based network structure to obtain the second image segmentation result is as follows:

extracting an original image from a video stream;

3. The method of claim 1, wherein the step of performing translation vector calculation on transition frames between key frames of the video stream on the device using the gray projection algorithm to obtain the second image segmentation result of the transition frames comprises:

the color images in the video stream are gray mapped in the G channel.

4. A method according to claim 3, wherein the step of performing translation vector calculation on transition frames between key frames of the video stream on the device using a gray projection algorithm to obtain a second image segmentation result of the transition frames comprises:

5. The method of claim 4, wherein the searching for the minimum value of the euclidean distance of the line/column gray level projection curve between two frames of images at all positions by using the class binary search method means: