CN112215762A - Video image processing method and device and electronic equipment - Google Patents

Video image processing method and device and electronic equipment Download PDF

Info

Publication number
CN112215762A
CN112215762A CN201910631228.3A CN201910631228A CN112215762A CN 112215762 A CN112215762 A CN 112215762A CN 201910631228 A CN201910631228 A CN 201910631228A CN 112215762 A CN112215762 A CN 112215762A
Authority
CN
China
Prior art keywords
target object
image
video frame
video
mask
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910631228.3A
Other languages
Chinese (zh)
Inventor
宋子奇
李晓波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910631228.3A priority Critical patent/CN112215762A/en
Publication of CN112215762A publication Critical patent/CN112215762A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • G06T5/77
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/04845Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range for image manipulation, e.g. dragging, rotation, expansion or change of colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20092Interactive image processing based on input by user
    • G06T2207/20104Interactive definition of region of interest [ROI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Abstract

The embodiment of the invention provides a video image processing method, a video image processing device and electronic equipment, wherein the method comprises the following steps: detecting a target object in a video frame; generating an image mask of the target object; and performing blurring processing on the background image of the target object according to the image mask to generate a new video frame and play the new video frame. The embodiment of the invention utilizes the image fusion processing and image tracking technology based on the mask, continuously updates the image characteristic data of the target object, and can quickly realize special effect processing in the video stream, thereby improving the processing efficiency of the video focusing special effect.

Description

Video image processing method and device and electronic equipment
Technical Field
The application relates to a video image processing method and device and electronic equipment, and belongs to the technical field of computers.
Background
In video playback scenes, there is a need to blur the background of a particular object to produce a similar focus effect, such as a desire to produce a focus effect on a particular person. Taking focusing on a person as an example, a plurality of persons including a specific person may exist in the same lens, and in the conventional processing method, in the post-production of a video, the specific person is distinguished from the plurality of persons by a manual selection mode, and an outline is extracted, so that an effect of blurring a background is realized.
However, this processing method needs to consume a lot of human resources, and has a very low processing efficiency, and cannot be applied to the scene of live video.
Disclosure of Invention
The embodiment of the invention provides a video image processing method, a video image processing device and electronic equipment, which can realize special effect processing in a video stream in real time and quickly.
In order to achieve the above object, an embodiment of the present invention provides a video image processing method, including:
detecting a target object in a video frame;
generating an image mask of the target object;
and performing blurring processing on the background image of the target object according to the image mask to generate a new video frame and play the new video frame.
An embodiment of the present invention further provides a video image processing apparatus, including:
a target object detection module: for detecting a target object in a video frame;
a mask generation module: an image mask for generating the target object;
a blurring processing module: and the image processing module is used for blurring the background image of the target object according to the image mask to generate a new video frame and playing the new video frame.
The embodiment of the invention also provides a video image processing method, which comprises the following steps:
in a current video frame of the live video, responding to a selection operation of a user, and determining a selected object as a target object;
detecting the target object in a subsequent video frame;
generating an image mask of the target object;
and performing blurring processing on the background image of the target object according to the image mask to generate a new live broadcast video and play the live broadcast video.
The embodiment of the invention also provides a video image processing method, which comprises the following steps:
in a current video frame of the live video, responding to a selection operation of a user, and determining a selected object as a target object;
detecting the target object in a subsequent video frame;
generating an image mask of the target object;
and processing the target object and/or the background image except the target object according to the image mask, distinguishing the target object from the background image except the target object, and generating and playing a new live video.
An embodiment of the present invention further provides an electronic device, including:
a memory for storing a program;
and the processor is used for operating the program stored in the memory so as to execute the video image processing method.
The embodiment of the invention utilizes the image fusion processing and image tracking technology based on the mask, continuously updates the image characteristic data of the target object, and can quickly realize special effect processing in the video stream, thereby improving the processing efficiency of the video focusing special effect, further allowing a user to specify the target object, and better meeting the user requirements.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
FIG. 1 is a schematic structural diagram of a mask generation model of a video image processing method according to an embodiment of the present invention;
fig. 2a to 2c are schematic diagrams of one to three application scenarios of the video image processing method according to the embodiment of the invention;
FIGS. 3a to 3c are four to six schematic application scenarios of the video image processing method according to the embodiment of the invention;
fig. 4 to 5 are seven to eight schematic application scenarios of the video image processing method according to the embodiment of the invention;
FIG. 6 is a flowchart illustrating a video image processing method according to an embodiment of the invention;
FIG. 7 is a block diagram of a video image processing apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art
In a video playing scene, the background of a specific object needs to be blurred to generate a similar focusing effect, a large amount of human resources need to be consumed in a manual selection mode, the processing efficiency is extremely low, and the method cannot be applied to a live video scene.
The embodiment of the invention utilizes the image fusion processing and image tracking technology based on the mask to continuously update the image characteristic data of the target object, and can realize special effect processing in the video stream in real time and rapidly.
The video image processing method of the embodiment of the invention can be realized by the following processing:
(1) selecting a target object and extracting image characteristics of the target object:
in the process of live video broadcast, the background of a specific object needs to be blurred to generate a focusing-like effect of the object, and for convenience of description, the specific object is referred to as a target object. The determination of the target object may be performed by selecting one or more objects from the video frame in a manner selected by the user, and the user may select one or more objects from the display of the video frame by using a mouse or a touch screen. Specifically, the object in the bounding box may be used as the target object by acquiring the coordinates of the position selected by the user in the video frame and then identifying the bounding box to which the position coordinates belong. The bounding box referred to herein means the maximum bounding rectangle of an object in the image, i.e., the maximum bounding rectangle of the target object.
In addition, a certain object or a certain type of object in the video image can be automatically identified by the system to select the target object. For example, the type of the target object (e.g., a person or a specific object) is set in advance, and then the target object is automatically selected by a technique such as image recognition.
After the target object is determined, image feature data (hereinafter referred to as a first image feature) of the target object is extracted from the current video frame and used for tracking and comparing subsequent video frames. The extraction of the image feature data for the target object may be realized by, for example, CNN (convolutional neural network).
(2) Tracking and identifying the target object and updating image characteristics:
in the video stream, the feature data of the target object may change between video frames, so that while performing feature processing on the video frames, after the target object is identified in subsequent video frames based on the first image feature of the target object, the image feature is extracted again as the first image feature of the current target object, that is, for each subsequent video frame, the first image feature is updated to the latest feature of the target object, thereby realizing tracking of the target object.
Specifically, for the image of each subsequent video frame, the image feature data of all objects in the video frame may be extracted, and for convenience of distinguishing, referred to as second image feature data herein, the second image feature data is compared with the first image feature data one by one, the similarity between the image feature data of each object and the first image feature data is calculated, and if there is an object whose similarity is greater than a preset threshold, the object is determined to be a target object. And the image characteristic data of the object is used as first image characteristic data for characteristic comparison in subsequent video frames.
(3) Generating a mask:
the mask of the image is a binary image used for identifying local regions in the image, and the size of the binary image is the same as that of the image, the pixel value of the identified region in the image in the mask of the image is 255, and the pixel value of the unidentified region in the image mask is 0. In the present invention, the mask serves to identify the outline of the target object, thereby enabling discrimination processing between the target object and images other than the target object. Therefore, in the mask of the target object, the pixel value of the object portion is 255, and the pixel values of the other portions are 0.
Specifically, as shown in fig. 1, which is a schematic structural diagram of a mask generation model of a video image processing method according to an embodiment of the present invention, specifically, a video frame image is extracted by an image feature extraction layer 31 to obtain feature data 41 of the whole image, for example, CNN (convolutional neural network) can be used as the image feature extraction layer, so as to generate a feature map (feature map) of the whole image, and the feature map can be used as the feature data 41 of the whole image.
The extracted feature data 41 is identified by the ROI (region of interest) recommendation layer 32 to find out a region of interest (ROI), that is, a region where the target object may exist. Then, feature data of the region of interest is extracted from the feature data of the entire image. As shown in the figure, after the ROI recommendation layer identifies three regions of interest (three regions marked by box 42), image feature data 43, 44, 45 corresponding to the three regions of interest are extracted from the feature data of the whole image.
Since the sizes of the regions of interest may be different, in order to facilitate the processing of the subsequent layer, the feature data of the candidate target region may pass through the ROI alignment layer 33, and the feature data of the region of interest is aligned to obtain aligned image feature data 46, 47, 48.
The feature data of the aligned region of interest is provided to the classification layer 34 for classification and identification, and the output result of the classification layer is: whether or not objects of the same category as the target object are contained in the region of interest. For convenience of explanation, the target object is a specific object that we want, for example, a desired special effect video includes a plurality of characters, and only one specific character is desired to be focused, so that the target object is the specific character, the category of the target object is the character, the classification layer 34 outputs whether the object included in the region of interest is a character, and whether the object is the specific character is identified in the feature comparison processing other than the model. In the case where the classification result includes an object of the same type as the target object, the bounding box calculation layer 35 calculates the position coordinates of the bounding box of the target object, and outputs the position coordinates to the mask calculation layer 36.
The mask calculation layer 36 performs mask calculation processing based on the output results of the classification layer 34 and the bounding box calculation layer 35, and outputs a mask. Specifically, if the classification result is that an object of the same category as the target object is included, the mask of the object is output through calculation by the mask calculation layer 36, and if the classification result is that an object of the same category as the target object is not included, the mask may not be output or an invalid mask may be output.
In addition, the output image masks are not necessarily all used, and in the embodiment of the present invention, for example, the classification layer in the mask generation model is trained to recognize a person, masks of a plurality of persons are output as long as a plurality of persons (each including one person) are recognized in each region of interest, and if a special effect process based on a specific person selected by a user is to be implemented in the process flow of the embodiment of the present invention, only the mask of the specific person (i.e., the target object) is used.
Of course, as a more simplified mask generation model, the classification layer 34 may only classify the object and the background, that is, the classification layer determines to identify the object in the region of interest, and therefore, a plurality of candidate masks are generated.
After the mask generation model generates a plurality of candidate masks for a certain subsequent video frame, the mask of the target object is selected based on the result of the above (2) tracking recognition processing of the target object. That is, by the above-mentioned feature matching process of the target object, the target object in the subsequent video frame is found, and information such as the position or the bounding box of the target object is determined, and based on the information, a corresponding mask can be found from the generated plurality of candidate masks, and used for the subsequent image fusion process.
It should be noted that the mask generation model shown in fig. 1 mainly functions to classify and identify objects in a video frame and generate masks, but intermediate data generated by the model during image processing, such as image feature data of the whole image, feature data of a region of interest, and position coordinates of a bounding box of an object in the image, may also be used in other parts of the processing in the embodiment of the present invention. For example, in view of the fact that the mask generation model has generated a plurality of regions of interest and feature extraction is performed on the images in these regions of interest, the image feature data of these regions of interest may be used as the aforementioned second image feature data. Of course, after the classification layer identifies the regions of interest, it is also possible to screen out which regions of interest contain objects of the same category as the target object, and use the feature data of the corresponding region of interest as the second image feature data.
(4) Image fusion:
in the process of image fusion, the whole original video frame image is subjected to blurring processing, specifically, a gaussian blurring algorithm can be adopted for processing, so that a blurred video frame image is generated.
Then, the mask of the target object output from the mask generation model is multiplied by the original video frame image to generate a mask image (an image obtained by multiplying the mask by the original image for convenience of description, referred to as a mask image) in which the pixel value of the target object is 0 while the pixel value of the portion other than the target object remains the pixel value of the original video frame image. This process can analogize the mask, through which the prototype of the target object (the object's original pixel values) is grabbed from the original video frame image, to a mold in a real scene.
And finally, fusing the mask image and the blurred image, wherein the original pixel values of the target object part are reserved, and the blurred pixel values of the parts except the target object part are reserved.
Therefore, through the image fusion, a new special effect video frame similar to the focusing effect is finally generated. In an actual application scene, in the process of live video broadcast, after a certain object is selected in a certain video frame, the above-mentioned step (1) is executed to process the current video frame, and then the steps from (2) to (4) are executed in a circulating manner to perform circulating processing on the subsequent frames, so that the video playing has a focusing effect.
In the following, an application scenario of the embodiment of the present invention is described by specific examples, as shown in fig. 2a to 2c, which are one to three schematic application scenarios of the video image processing method of the embodiment of the present invention, and are shown as a video scenario of multi-player playing, in fig. 2a, a person in a bounding box to which a selected coordinate belongs may be determined as a target object, that is, a person in a dashed frame in the figure, by a selection of a user, and then a mask image of the target person shown in fig. 2b is generated by the above mask generation model, so that a special effect video image focused on the target person shown in fig. 2c is finally obtained by performing subsequent blurring processing on the whole image of the original video frame and fusing the mask image, wherein only a part of the target person keeps a clear image, and a part other than the target person is a blurred image, thereby creating a focusing effect. In addition, the target object may also be a plurality of objects, as shown in fig. 3a to 3c, which are four to six of application scene schematic diagrams of the video image processing method according to the embodiment of the present invention, in the scene, a plurality of people in a dashed frame selected by the user or automatically recognized in fig. 3b may be determined as the target object according to actual requirements, so as to generate mask images of the plurality of target objects shown in fig. 3c, and then, after fusing the mask images with the whole image after the blurring processing, a special effect image (not shown) with a plurality of people as focuses is formed, wherein only the people keep clear images, and other parts of the image are blurred images, so as to form a focusing effect for multiple people.
In addition, the embodiment of the present invention may also be applied to a scene in which a video live broadcast is performed by a main broadcast, as shown in fig. 4 to 5, which are seven to eight schematic application scenes of the video image processing method according to the embodiment of the present invention, in the scene, a bounding box of coordinates where a person in an image is located, such as a rectangular frame surrounding the main broadcast shown in the figure, may be obtained by performing person segmentation detection on a video frame image, then, a target person is determined after selection by a user, a mask of the target person is generated by a mask generation model, and finally, a special effect video image having a focusing effect on the main broadcast person is generated based on the mask image and the entire image after the blurring processing, as shown in the figure. Further, as shown in fig. 5, for the case that there are multiple anchor characters in the video image, similarly, the bounding box of the coordinates of multiple characters in the image can be obtained by performing character segmentation detection on the video frame image, which is shown as two anchor characters in the figure, the rectangular box is a bounding box of the coordinates of the two anchor characters, and the user selects a target character as needed, so as to determine one of the anchor characters as the target character, thereby generating a mask of the target character through a mask generation model, and finally performing blurring processing to obtain the special effect video frame image with focusing effect on the anchor (target character) as shown in the figure.
The embodiment of the invention utilizes the image fusion processing and image tracking technology based on the mask, continuously updates the image characteristic data of the target object, and can quickly realize special effect processing in the video stream, thereby improving the processing efficiency of the video focusing special effect, allowing a user to specify the target object and better meeting the user requirement.
The technical solution of the present invention is further illustrated by some specific examples.
Example one
As shown in fig. 6, which is a schematic flowchart of a video image processing method according to an embodiment of the present invention, the method includes the following steps:
s101: a target object in a video frame is detected. The target object can be determined by means of human selection or automatic identification of the system. In the case of adopting a manual selection manner, before step S101, the embodiment of the present invention may further include step S100: and in response to the selection operation of the object in the current video frame, taking the selected object as a target object, wherein the object can be a person or other objects.
In a certain video frame, the target object can be determined in a way selected by the user. Specifically, the object detection segmentation may be performed on the current video frame in advance to form a bounding box of each object. Then, by acquiring coordinates of a position selected by the user in the image, an enclosure frame to which the position coordinates belong is identified, and an object in the enclosure frame is taken as a target object.
In addition, a certain object or a certain type of object in the video image can be automatically identified by the system to select the target object. For example, the type of the target object (e.g., a person or a specific object) is set in advance, and then the target object is automatically selected by a technique such as image recognition.
Further, after the target object is determined in the current video frame, the step S101 is performed for the subsequent video frame: a target object in a video frame is detected.
Specifically, image feature extraction may be performed on a target object in a current video frame, first image feature data of the target object is generated, second image feature data of one or more objects in a subsequent video frame is extracted, similarity between each object in the subsequent video frame and the target object is calculated according to the first image feature data and the second image feature data, if an object whose similarity is greater than a preset threshold exists, the object is determined as the target object,
in the video stream, the feature data of the target object is changed among the video frames, so that the image feature data of the object of the same type as the target object is extracted for each subsequent video frame, for the convenience of distinguishing, the image feature data is referred to as second image feature data, the second image feature data is compared with the first image feature data, the similarity between the image feature data of each object and the first image feature data is calculated, and if an object with the similarity larger than a preset threshold exists, the object is determined to be the target object. And after the target object is determined, updating the first image characteristic data into the image characteristic data of the target object for characteristic comparison in subsequent video frames.
S102: an image mask of the target object is generated.
Specifically, the mask generation may be implemented by a mask generation model, as shown in fig. 1, the video frame image extracts feature data of the whole image through an image feature extraction layer, and the extracted feature data is identified through an ROI recommendation layer to find out a region of interest, that is, a region where a target object may exist. Then, feature data of the region of interest is extracted from the feature data of the entire image. Since the sizes of the regions of interest may be different, in order to facilitate processing of subsequent layers, the feature data of the candidate target region may pass through the ROI alignment layer, and the feature data of the region of interest is aligned.
The aligned feature data of the region of interest is provided to a classification layer for classification and identification, and the output result of the classification layer is as follows: whether or not objects of the same category as the target object are contained in the region of interest. When the classification result includes an object of the same type as the target object, the bounding box calculation layer calculates the position coordinates of the bounding box of the target object and outputs the position coordinates to the mask calculation layer.
And the mask calculation layer performs mask calculation processing according to the classification layer and the output result of the surrounding frame calculation layer, and outputs a mask. Specifically, if the classification result is that an object of the same category as the target object is included, the mask of the object is output through calculation of the mask calculation layer, and if the classification result is that an object of the same category as the target object is not included, the mask may not be output or an invalid mask may be output.
It should be noted that the mask output by the model shown in fig. 1 is a mask of one or more objects of the same type as the target object, and in addition to the model, the target object may be determined according to image feature comparison, and then the mask of the target object may be directly obtained through processing of the bounding box calculation layer and the mask calculation layer in the model (i.e., processing of the classification layer may be omitted) based on the image feature data of the target object.
S103: and performing blurring processing on the background image of the target object according to the image mask to generate a new video frame and play the new video frame.
Further, the blurring process may specifically be: firstly, blurring the subsequent video frame, wherein the blurring process is the blurring process of the whole video frame, and then the video frame before blurring process and the video frame after blurring process are fused by using an image mask to generate a new special-effect video frame with similar focusing effect. And image fusion is carried out by utilizing the image mask, so that in the special-effect video frame, the image of the video frame before the blurring processing is reserved at the target object part, and the image of the video frame after the blurring processing is reserved at the part except the target object. And blurring the subsequent video frames to generate blurred video frames through a Gaussian blurring algorithm.
In the process of live video broadcast, after a certain object is selected in a certain video frame, the process of generating a special effect video frame circularly executes the processing on the subsequent frames, so that the video has a focusing effect in video playing. In addition, for the convenience of the user, a key for the user to perform switching operation can be arranged on the application interface, so that the playing switching between the special-effect video frame and the original video frame is realized in response to the user operation instruction. In addition, the method is not only suitable for video processing in a live scene, but also can carry out special focusing effect processing on general videos.
The embodiment of the invention utilizes the image fusion processing and image tracking technology based on the mask, continuously updates the image characteristic data of the target object, and can quickly realize special effect processing in the video stream, thereby improving the processing efficiency of the video focusing special effect, further allowing a user to specify the target object, and better meeting the user requirements.
Example two
As shown in fig. 7, which is a schematic structural diagram of a video image processing apparatus according to an embodiment of the present invention, the apparatus includes:
target object detection module 21: for detecting a target object in a video frame. The target object can be determined by means of human selection or automatic identification of the system. The situation of the artificial selection mode can be realized by a target object selection module 20 included in the apparatus, specifically, the target object selection module 20 is configured to, before image feature extraction, take the selected object as the target object in response to a selection operation on an object in the current video frame.
In a certain video frame, the target object can be determined in a way selected by the user. Specifically, the object detection segmentation may be performed on the current video frame in advance to form a bounding box of each object. Then, by acquiring coordinates of a position selected by the user in the image, an enclosure frame to which the position coordinates belong is identified, and an object in the enclosure frame is taken as a target object.
In addition, a certain object or a certain type of object in the video image can be automatically identified by the system to select the target object. For example, the type of the target object (e.g., a person or a specific object) is set in advance, and then the target object is automatically selected by a technique such as image recognition.
Further, after the target object is determined in the current video frame, the target object in the video frame is detected for the subsequent video frame by the target object detection module 21. Specifically, image feature extraction may be performed on a target object in a current video frame, first image feature data of the target object is generated, second image feature data of one or more objects in a subsequent video frame is extracted, similarity between each object in the subsequent video frame and the target object is calculated according to the first image feature data and the second image feature data, if an object whose similarity is greater than a preset threshold exists, the object is determined as the target object,
in the video stream, the feature data of the target object is changed among the video frames, so that the image feature data of the object of the same type as the target object is extracted for each subsequent video frame, for the convenience of distinguishing, the image feature data is referred to as second image feature data, the second image feature data is compared with the first image feature data, the similarity between the image feature data of each object and the first image feature data is calculated, and if an object with the similarity larger than a preset threshold exists, the object is determined to be the target object. And after the target object is determined, updating the first image characteristic data into the image characteristic data of the target object for characteristic comparison in subsequent video frames.
The mask generation module 22: for generating an image mask of the target object.
Specifically, the mask generation may be implemented by a mask generation model, the video frame image extracts feature data of the whole image through an image feature extraction layer, and the extracted feature data is identified through an ROI recommendation layer to find out a region of interest, that is, a region where a target object may exist. Then, feature data of the region of interest is extracted from the feature data of the entire image. Since the sizes of the regions of interest may be different, in order to facilitate processing of subsequent layers, the feature data of the candidate target region may pass through the ROI alignment layer, and the feature data of the region of interest is aligned.
The aligned feature data of the region of interest is provided to a classification layer for classification and identification, and the output result of the classification layer is as follows: whether or not objects of the same category as the target object are contained in the region of interest. When the classification result includes an object of the same type as the target object, the bounding box calculation layer calculates the position coordinates of the bounding box of the target object and outputs the position coordinates to the mask calculation layer.
And the mask calculation layer performs mask calculation processing according to the classification layer and the output result of the surrounding frame calculation layer, and outputs a mask. Specifically, if the classification result is that an object of the same category as the target object is included, the mask of the object is output through calculation of the mask calculation layer, and if the classification result is that an object of the same category as the target object is not included, the mask may not be output or an invalid mask may be output.
The blurring processing module 23: and the virtual processing module is used for carrying out virtual processing on the background image of the target object according to the image mask to generate a new video frame and playing the new video frame.
Specifically, the subsequent video frames are blurred, the video frames before blurring and the video frames after blurring are fused by using an image mask to generate a new special effect video frame with a similar focusing effect, in the special effect video frame, the image of the video frame before blurring is retained in the target object part, and the image of the video frame after blurring is retained in the part except the target object.
And blurring the subsequent video frames to generate blurred video frames through a Gaussian blurring algorithm. In the process of live video broadcast, after a certain object is selected in a certain video frame, the process of generating a special effect video frame circularly executes the processing on the subsequent frames, so that the video has a focusing effect in video playing. In addition, for the convenience of the user, the device may further include a play switching module to implement play switching between the special effect video frame and the original video frame in response to a user operation instruction.
The embodiment of the invention utilizes the image fusion processing and image tracking technology based on the mask, continuously updates the image characteristic data of the target object, and can quickly realize special effect processing in the video stream, thereby improving the processing efficiency of the video focusing special effect, allowing a user to specify the target object and better meeting the user requirement.
EXAMPLE III
The embodiment of the invention also provides a video processing method in live broadcast, which can be applied to a video live broadcast platform and specifically comprises the following steps:
s301: in the current video frame of the live video, the selected object is determined as the target object in response to the selection operation of the user.
In a live video scene, in a certain video frame, a target object can be determined in a mode selected by a user. Specifically, the object detection segmentation may be performed on the current video frame in advance to form a bounding box of each object. Then, by acquiring coordinates of a position selected by the user in the image, an enclosure frame to which the position coordinates belong is identified, and an object in the enclosure frame is taken as a target object.
In addition, a certain object or a certain type of object in the video image can be automatically identified by the system to select the target object. For example, the type of the target object (e.g., a person or a specific object) is set in advance, and then the target object is automatically selected by a technique such as image recognition.
S302: in subsequent video frames, a target object is detected.
Further, after the target object is determined in the current video frame, performing image feature extraction on the target object in the current video frame to generate first image feature data of the target object, extracting second image feature data of one or more objects in a subsequent video frame, calculating the similarity between each object in the subsequent video frame and the target object according to the first image feature data and the second image feature data, if an object with the similarity larger than a preset threshold exists, determining the object as the target object,
in the video stream, the feature data of the target object is changed among the video frames, so that the image feature data of the object of the same type as the target object is extracted for each subsequent video frame, for the convenience of distinguishing, the image feature data is referred to as second image feature data, the second image feature data is compared with the first image feature data, the similarity between the image feature data of each object and the first image feature data is calculated, and if an object with the similarity larger than a preset threshold exists, the object is determined to be the target object. And after the target object is determined, updating the first image characteristic data into the image characteristic data of the target object for characteristic comparison in subsequent video frames.
S303: an image mask of the target object is generated.
S304: and blurring the background image of the target object according to the image mask to generate a new live broadcast video and playing the new live broadcast video.
Specifically, the video image after blurring processing becomes a blurred image with only a clear target object, and the parts except the target object are blurred images, so that a new special-effect video frame with a similar focusing effect is generated and played in real time.
In addition, in order to facilitate the use of the user, a key for the user to perform switching operation may be arranged on the live video application interface, so as to implement playback switching between a new video frame and an original video frame in response to a user operation instruction.
The embodiment of the invention utilizes the image fusion processing and image tracking technology based on the mask, continuously updates the image characteristic data of the target object, and can quickly realize special effect processing in the live video stream, thereby improving the processing efficiency of the video focusing special effect and adapting to the real-time processing requirement of the live video. In addition, the user can be allowed to flexibly designate the target object, so that the user requirements can be better met.
Example four
The embodiment of the invention also provides a video processing method in live broadcast, which can be applied to a video live broadcast platform and specifically comprises the following steps:
and S401, in the current video frame of the live video, responding to the selection operation of the user, and determining the selected object as the target object.
In a live video scene, in a certain video frame, a target object can be determined in a mode selected by a user. In addition, a certain object or a certain type of object in the video image can be automatically identified by the system to select the target object. For example, in a specific live video scene, a main player shows and introduces a commodity for sale in a video, in the process, a main player character or a commodity being introduced can be determined as a target object through selection of a user or according to different contents introduced in the live broadcast, for example, when the main player introduces the specific commodity, the commodity can be correspondingly taken as the target object, so that the commodity is highlighted in a subsequent processing process to be more intuitively and deeply shown for a buyer user, and the purchase desire is stimulated. In the process, all the items (including the anchor) except the commodity in the video are treated as background. After the exhibition introduction of the commodity is finished, the target object can be converted into an anchor character so as to highlight the interactivity of the live video scene.
S402: in subsequent video frames, a target object is detected.
Further, after the target object is determined in the current video frame, performing image feature extraction on the target object in the current video frame to generate first image feature data of the target object, extracting second image feature data of one or more objects in a subsequent video frame, calculating the similarity between each object in the subsequent video frame and the target object according to the first image feature data and the second image feature data, if an object with the similarity larger than a preset threshold exists, determining the object as the target object,
in the video stream, the feature data of the target object is changed among the video frames, so that the image feature data of the object of the same type as the target object is extracted for each subsequent video frame, for the convenience of distinguishing, the image feature data is referred to as second image feature data, the second image feature data is compared with the first image feature data, the similarity between the image feature data of each object and the first image feature data is calculated, and if an object with the similarity larger than a preset threshold exists, the object is determined to be the target object. And after the target object is determined, updating the first image characteristic data into the image characteristic data of the target object for characteristic comparison in subsequent video frames.
S403: an image mask of the target object is generated.
S404: and processing the target object or the background image except the target object according to the image mask, distinguishing the target object from the background image except the target object, and generating and playing a new live video.
Specifically, the target object and the background image other than the target object may be processed in different manners, wherein the target object may be distinguished by highlighting the target object, for example, the image of the target object may be highlighted, and the background image may be subjected to gray scale processing to achieve an effect of distinguishing the target object from the background image, and in addition, the background image may be blurred to achieve an effect of focusing on the target object.
In order to facilitate the use of the user, a key for the user to perform switching operation may be arranged on the video live application interface, so as to implement playback switching between a new video frame and an original video frame in response to a user operation instruction.
In addition, in an actual scene, the special video image processing may be provided as a privilege to the VIP user, for example, a VIP merchant user and/or a VIP buyer user, wherein the VIP merchant user may focus and display the goods as the target object in the video live broadcast process according to the needs of goods display, and in addition, the VIP buyer user may view the video live broadcast with the focusing effect on the target object by selecting the target object according to the specific needs.
The embodiment of the invention utilizes the image fusion processing and image tracking technology based on the mask, continuously updates the image characteristic data of the target object, and can quickly realize special effect processing in the live video stream, thereby improving the processing efficiency of the video focusing special effect and adapting to the real-time processing requirement of the live video. In addition, the user can be allowed to flexibly designate the target object, so that the user requirements can be better met.
EXAMPLE five
The foregoing embodiment describes a flow process and a device structure according to an embodiment of the present invention, and the functions of the method and the device can be implemented by an electronic device, as shown in fig. 8, which is a schematic structural diagram of the electronic device according to an embodiment of the present invention, and specifically includes: a memory 110 and a processor 120.
And a memory 110 for storing a program.
In addition to the programs described above, the memory 110 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.
The memory 110 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
A processor 120, coupled to the memory 110, for executing the program in the memory 110, for performing the operation steps of the video image processing method described in the foregoing embodiments.
Further, the processor 120 may also include various modules described in the foregoing embodiments to perform video image processing, and the memory 110 may be used, for example, to store data required for the modules to perform operations and/or output data.
The above detailed descriptions of the processing procedure, the technical principle, and the technical effect are described in detail in the foregoing embodiments, and are not repeated herein.
Further, as shown, the electronic device may further include: communication components 130, power components 140, audio components 150, display 160, and other components. Only some of the components are schematically shown in the figure and it is not meant that the electronic device comprises only the components shown in the figure.
The communication component 130 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 130 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 130 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
The power supply component 140 provides power to the various components of the electronic device. The power components 140 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for an electronic device.
The audio component 150 is configured to output and/or input audio signals. For example, the audio component 150 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 110 or transmitted via the communication component 130. In some embodiments, audio assembly 150 also includes a speaker for outputting audio signals.
The display 160 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (20)

1. A video image processing method, comprising:
detecting a target object in a video frame;
generating an image mask of the target object;
and performing blurring processing on the background image of the target object according to the image mask to generate a new video frame and play the new video frame.
2. The method of claim 1, further comprising:
and in response to the selected operation on the object in the current video frame, taking the selected object as the target object.
3. The method of claim 2, wherein the regarding the selected object as the target object in response to the selection operation of the object in the current video frame comprises:
carrying out object detection segmentation on the current video frame to form an enclosing frame of each object;
and acquiring the position coordinate selected by the user, identifying the surrounding frame to which the position coordinate belongs, and taking the object in the surrounding frame as the target object.
4. The method of claim 1, wherein the detecting a target object in a video frame comprises:
extracting image characteristics of a target object in a current video frame to generate first image characteristic data of the target object,
extracting second image characteristic data of one or more objects in a subsequent video frame, calculating the similarity between each object in the subsequent video frame and the target object according to the first image characteristic data and the second image characteristic data, and determining the object as the target object if the object with the similarity larger than a preset threshold exists.
5. The method of claim 4, wherein after determining the target object, further comprising: first image characteristic data of the target object is updated.
6. The method of claim 1, wherein the blurring the target object background image according to the image mask comprises:
blurring the subsequent video frame;
and fusing the video frame before blurring processing and the video frame after blurring processing by using the image mask to generate a new video frame, wherein in the new video frame, the image of the video frame before blurring processing is reserved in the target object part, and the image of the video frame after blurring processing is reserved in the part except the target object.
7. The method of claim 1, further comprising:
and responding to a user operation instruction, and performing play switching between the new video frame and the original video frame.
8. The method of claim 1, wherein the object is a human figure.
9. A video image processing apparatus comprising:
a target object detection module: for detecting a target object in a video frame;
a mask generation module: an image mask for generating the target object;
a blurring processing module: and the image processing module is used for blurring the background image of the target object according to the image mask to generate a new video frame and playing the new video frame.
10. The apparatus of claim 9, further comprising:
a target object selection module: and the controller is used for responding to the selected operation of the object in the current video frame and taking the selected object as the target object.
11. The apparatus of claim 10, wherein the regarding the selected object as the target object in response to the selection operation of the object in the current video frame comprises:
carrying out object detection segmentation on the current video frame to form an enclosing frame of each object;
and acquiring the position coordinate selected by the user, identifying the surrounding frame to which the position coordinate belongs, and taking the object in the surrounding frame as the target object.
12. The apparatus of claim 9, wherein the detecting a target object in a video frame comprises:
extracting image characteristics of a target object in a current video frame to generate first image characteristic data of the target object,
extracting second image characteristic data of one or more objects in a subsequent video frame, calculating the similarity between each object in the subsequent video frame and the target object according to the first image characteristic data and the second image characteristic data, and determining the object as the target object if the object with the similarity larger than a preset threshold exists.
13. The apparatus of claim 12, wherein the detecting a target object in a video frame further comprises:
after the target object is determined, first image feature data of the target object is updated.
14. The apparatus of claim 9, wherein the blurring the target object background image according to the image mask comprises:
blurring the subsequent video frame;
and fusing the video frame before blurring processing and the video frame after blurring processing by using the image mask to generate a new video frame, wherein in the new video frame, the image of the video frame before blurring processing is reserved in the target object part, and the image of the video frame after blurring processing is reserved in the part except the target object.
15. An electronic device, comprising:
a memory for storing a program;
a processor for executing the program stored in the memory to perform the video image processing method of any one of claims 1 to 8.
16. A video image processing method, comprising:
in a current video frame of the live video, responding to a selection operation of a user, and determining a selected object as a target object;
detecting the target object in a subsequent video frame;
generating an image mask of the target object;
and performing blurring processing on the background image of the target object according to the image mask to generate a special-effect live video and playing the special-effect live video.
17. The method of claim 16, wherein said detecting the target object in the subsequent video frame comprises:
extracting image characteristics of a target object in a current video frame to generate first image characteristic data of the target object,
extracting second image characteristic data of one or more objects in a subsequent video frame, calculating the similarity between each object in the subsequent video frame and the target object according to the first image characteristic data and the second image characteristic data, if an object with the similarity larger than a preset threshold exists, determining the object as the target object, and updating the first image characteristic data of the target object.
18. An electronic device, comprising:
a memory for storing a program;
a processor for executing the program stored in the memory to perform the video image processing method of claim 16 or 17.
19. A video image processing method, comprising:
in a current video frame of the live video, responding to a selection operation of a user, and determining a selected object as a target object;
detecting the target object in a subsequent video frame;
generating an image mask of the target object;
and processing the target object and/or the background image except the target object according to the image mask, distinguishing the target object from the background image except the target object, and generating and playing a new live video.
20. An electronic device, comprising:
a memory for storing a program;
a processor for executing the program stored in the memory to perform the video image processing method of claim 19.
CN201910631228.3A 2019-07-12 2019-07-12 Video image processing method and device and electronic equipment Pending CN112215762A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910631228.3A CN112215762A (en) 2019-07-12 2019-07-12 Video image processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910631228.3A CN112215762A (en) 2019-07-12 2019-07-12 Video image processing method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN112215762A true CN112215762A (en) 2021-01-12

Family

ID=74047929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910631228.3A Pending CN112215762A (en) 2019-07-12 2019-07-12 Video image processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112215762A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114630057A (en) * 2022-03-11 2022-06-14 北京字跳网络技术有限公司 Method and device for determining special effect video, electronic equipment and storage medium
CN117593211A (en) * 2023-12-15 2024-02-23 书行科技(北京)有限公司 Video processing method, device, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114630057A (en) * 2022-03-11 2022-06-14 北京字跳网络技术有限公司 Method and device for determining special effect video, electronic equipment and storage medium
CN114630057B (en) * 2022-03-11 2024-01-30 北京字跳网络技术有限公司 Method and device for determining special effect video, electronic equipment and storage medium
CN117593211A (en) * 2023-12-15 2024-02-23 书行科技(北京)有限公司 Video processing method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109819313B (en) Video processing method, device and storage medium
US10645332B2 (en) Subtitle displaying method and apparatus
CN110287874B (en) Target tracking method and device, electronic equipment and storage medium
CN111612873B (en) GIF picture generation method and device and electronic equipment
US20120327172A1 (en) Modifying video regions using mobile device input
TWI605712B (en) Interactive media systems
US10841557B2 (en) Content navigation
CN110674719A (en) Target object matching method and device, electronic equipment and storage medium
KR101895846B1 (en) Facilitating television based interaction with social networking tools
US20230021533A1 (en) Method and apparatus for generating video with 3d effect, method and apparatus for playing video with 3d effect, and device
CN112669197A (en) Image processing method, image processing device, mobile terminal and storage medium
CN109862380A (en) Video data handling procedure, device and server, electronic equipment and storage medium
CN105247567A (en) Image refocusing
CN113409342A (en) Training method and device for image style migration model and electronic equipment
US20210304447A1 (en) Customizing soundtracks and hairstyles in modifiable videos of multimedia messaging application
CN106454411B (en) Station caption processing method and device
CN108986117B (en) Video image segmentation method and device
CN112215762A (en) Video image processing method and device and electronic equipment
CN112714257A (en) Display control method, display control device, electronic device, and medium
CN115115959A (en) Image processing method and device
CN113873166A (en) Video shooting method and device, electronic equipment and readable storage medium
CN111429551A (en) Image editing method, device, electronic equipment and storage medium
EP4360071A1 (en) Integrated system for detecting and correcting content
CN112437231B (en) Image shooting method and device, electronic equipment and storage medium
US20230368338A1 (en) Image display method and apparatus, and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination