CN113066092A

CN113066092A - Video object segmentation method and device and computer equipment

Info

Publication number: CN113066092A
Application number: CN202110340973.XA
Authority: CN
Inventors: 袁瑞峰; 周梦涵; 鲍海明
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-07-02
Anticipated expiration: 2041-03-30
Also published as: CN113066092B

Abstract

The application provides a video object segmentation method, a device and computer equipment, after a first video frame and a historical mask image of an adjacent historical video frame are obtained, the historical mask image and the first video frame are utilized to carry out object segmentation processing according to a first segmentation mode to obtain a first mask image of the first video frame, the first video frame is utilized to carry out object segmentation processing according to a second segmentation mode to obtain a calibration mask image of the first video frame, the calibration mask image is ensured not to be influenced by the historical mask image, under the condition that a comparison result of the calibration mask image and the first mask image meets a video segmentation calibration condition, the first mask image can be determined to be inaccurate, the calibration mask image is directly used as a target mask image of the first video frame to be output, and therefore the problem of directly outputting the first mask image is solved, when problems such as rapid movement of video objects occur, the technical problem that the video object segmentation effect is increasingly poor is caused.

Description

Video object segmentation method and device and computer equipment

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and an apparatus for segmenting a video object, and a computer device.

Background

With the development of artificial intelligence and image processing technology, an artificial intelligence method is generally applied to video object segmentation, and specifically, a Mask (Mask) image output from a previous frame is used to guide the realization of object segmentation processing on a current video frame, so that an object region is rapidly and accurately segmented from a video, and the subsequent image processing requirements are met.

However, in practical applications, taking a video conference as an example, if a user participating in the video conference moves fast, the object segmentation effect of the mask image output from the previous frame will be unstable, and further the object segmentation result of the current video frame will be disturbed, so that the object segmentation result of each subsequent frame will be worse and worse, and even other objects in the video conference, such as tables and chairs, will be dragged out.

Therefore, in a dynamic video object segmentation scene, how to accurately implement video object segmentation and ensure the reliability of an output video image becomes a problem that needs to be solved urgently by technical staff.

Disclosure of Invention

In view of this, the present application provides the following technical solutions:

in one aspect, the present application provides a video object segmentation method, including:

acquiring a first video frame and a historical mask image of a historical video frame adjacent to the first video frame;

according to a first segmentation mode, performing object segmentation processing by using the historical mask image and the first video frame to obtain a first mask image of the first video frame;

according to a second segmentation mode, carrying out object segmentation processing by using the first video frame to obtain a calibration mask image of the first video frame;

and under the condition that the comparison result of the calibration mask image and the first mask image meets the video segmentation calibration condition, outputting the calibration mask image as a target mask image of the first video frame.

In some embodiments, the comparison of the first mask image to the calibration mask image comprises:

acquiring first attribute information of the first mask image and second attribute information of the calibration mask image;

comparing the first attribute information with the second attribute information to obtain attribute differences between respective object segmentation areas of the first mask image and the calibration mask image;

the comparison of the calibration mask image and the first mask image satisfies a video segmentation calibration condition, including:

the attribute difference reaches a video segmentation calibration threshold.

In some embodiments, the obtaining first attribute information of the first mask image and the second attribute information of the calibration mask image comprises:

respectively carrying out pixel statistics on object segmentation regions respectively contained in the first mask image and the calibration mask image to obtain a first region area of the object segmentation region in the first mask image and a second region area of the object segmentation region in the calibration mask image;

the comparing the first attribute information and the second attribute information to obtain an attribute difference between object segmentation regions of the first mask image and the calibration mask image, includes:

and performing difference operation on the first region area and the second region area to obtain a region area difference between the respective object segmentation regions of the first mask image and the calibration mask image.

In some embodiments, the method further comprises:

acquiring segmentation state information of the first video frame;

and under the condition that the segmentation state information meets the video calibration condition, executing the step of performing object segmentation processing by using the first video frame according to a second segmentation mode to obtain a calibration mask image of the first video frame.

In some embodiments, the segmentation status information satisfying the video calibration condition comprises:

acquiring a calibration time interval corresponding to the first video frame, and determining that the calibration time interval reaches a video calibration time interval threshold; wherein, the calibration time interval refers to the time interval between the acquisition time point of the first video frame and the last video calibration time point;

or,

acquiring the frame number interval between the first video frame and the video frame of the last video calibration, and determining that the frame number interval reaches the threshold of the video calibration frame number interval.

In some embodiments, the method further comprises:

and in the case that the segmentation state information does not meet the video calibration condition, or in the case that the comparison result of the calibration mask image and the first mask image does not meet the video segmentation calibration condition, outputting the first mask image as a target mask image of the first video frame.

In some embodiments, the performing, according to a first segmentation mode, object segmentation processing by using the historical mask image and the first video frame to obtain a first mask image of the first video frame includes:

inputting the historical mask image and the first video frame into a first segmentation model, and outputting a first mask image of the first video frame;

the performing object segmentation processing on the first video frame according to an image segmentation mode to obtain a calibration mask image of the first video frame includes:

inputting the first video frame into a second segmentation model, and outputting a calibration mask image of the first video frame; or,

setting the historical mask image to zero to obtain a target historical mask image;

inputting the target historical mask image and the first video frame into a third segmentation model, and outputting a calibration mask image of the first video frame.

In some embodiments, in the case that the comparison of the calibration mask image and the first mask image satisfies the video segmentation calibration condition, the method further comprises:

and adjusting the model parameters of the first segmentation model according to the comparison result of the calibration mask image and the first video frame, so as to continue to perform object segmentation processing on the next video frame by using the adjusted first segmentation model.

In another aspect, the present application further provides a video object segmentation apparatus, including:

the acquisition module is used for acquiring a first video frame and a historical mask image of a historical video frame adjacent to the first video frame;

the first segmentation processing module is used for performing object segmentation processing by using the historical mask image and the first video frame according to a first segmentation mode to obtain a first mask image of the first video frame;

the second segmentation processing module is used for carrying out object segmentation processing by using the first video frame according to a second segmentation method to obtain a calibration mask image of the first video frame;

and the target mask image output module is used for outputting the calibration mask image as the target mask image of the first video frame under the condition that the comparison result of the calibration mask image and the first mask image meets the video segmentation calibration condition.

In yet another aspect, the present application further proposes a computer device, comprising:

a memory for storing a program for implementing the video object segmentation method as described above;

and the processor is used for loading and executing the program stored in the memory to realize the steps of the video object segmentation method.

Therefore, after acquiring the historical mask image of a first video frame and the historical mask image of the adjacent historical video frame, according to a first segmentation mode, the historical mask image and the first video frame are utilized to carry out object segmentation processing to obtain the first mask image of the first video frame, in order to avoid the inaccuracy and the unreliability of the first mask image and reduce the object segmentation effect of the next video frame under the guidance of the first mask image, the application also utilizes the first video frame to carry out object segmentation processing according to a second segmentation mode to obtain the calibration mask image of the first video frame, so that the acquisition process of the calibration mask image does not depend on the historical mask image of the previous video frame, the calibration mask image is ensured not to be adversely affected by the historical mask image, and thus, the calibration mask image is compared with the first mask image, under the condition that the obtained comparison result meets the video segmentation calibration condition, the first mask image is considered to be inaccurate, object segmentation of the next video frame is not guided by the first mask image, the first mask image is directly used as the target mask image of the first video frame to be output, and therefore the object segmentation effect of the next video frame guided by the target mask image is guaranteed.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of an alternative example of a video object segmentation method proposed in the present application;

fig. 2 is a schematic flow chart of still another alternative example of the video object segmentation method proposed in the present application;

fig. 3 is a schematic flowchart of another alternative example of the video object segmentation method proposed in the present application;

fig. 4 is a schematic flowchart of another alternative example of the video object segmentation method proposed in the present application;

fig. 5 is a schematic flowchart of yet another alternative example of the video object segmentation method proposed in the present application;

fig. 6 is a schematic structural diagram of an alternative example of the video object segmentation apparatus proposed in the present application;

FIG. 7 is a diagram illustrating a hardware configuration of an alternative example of a computer device suitable for use in the video object segmentation method and apparatus proposed in the present application;

fig. 8 is a schematic diagram of a video conference scene suitable for the video object segmentation method and apparatus proposed in the present application.

Detailed Description

Aiming at the technical problems described in the background technology part, the application provides that after the mask image of the current video frame is obtained by performing object segmentation by using the mask image output by the previous frame and the current video frame, the reliability of the mask image is verified to determine whether the mask image of the current video frame is needed to be used when the object segmentation is performed on the next video frame, so that the mask image of the current video frame can be calibrated or cleared in time under the condition of verifying that the mask image of the current video frame is unreliable, and the problem that the object segmentation effect of the subsequent video frame is increasingly poor and the output video frame is abnormal due to the fact that the unreliable mask image continuously guides to realize the object segmentation of the subsequent video frame is solved.

For the reliability verification process, the application proposes that only the current video frame is used for carrying out object segmentation processing, the obtained mask image is used as a calibration mask image, then the calibration mask image is compared with the mask image of the current video frame obtained in the mode, if the obtained comparison result meets the video segmentation condition, the mask image of the current video frame obtained in the mode can be considered to be unreliable, the application outputs the calibration mask image as the target mask image of the current video frame for guiding the realization of the object segmentation processing of the next video frame, and the reliability and the accuracy of the segmentation result of the next video frame are ensured, so that the output video frame can be ensured to be stable and reliable when a fast moving object appears in a scene such as a video conference.

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments, so for convenience of description, only the parts related to the related inventions are shown in the drawings, and the embodiments and features in the embodiments in the present application can be combined with each other without conflict. Therefore, all other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.

It should be understood that "system", "apparatus", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this application and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements. An element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

In the description of the embodiments herein, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of the present application, "a plurality" means two or more than two. The terms "first", "second" and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.

Additionally, flow charts are used herein to illustrate operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

Referring to fig. 1, a schematic flowchart of an optional example of the video object segmentation method proposed in the present application is shown, where the method may be applied to a computer device, such as various terminal devices or service devices, and the present application does not limit the product type of the computer device, which may be determined as the case may be. Moreover, the present application is applicable to a scene such as a video conference, where multiple frames of images need to be output continuously, and as shown in fig. 1, the video object segmentation method executed in the scene may include, but is not limited to, the following steps:

step S11, acquiring a first video frame and a historical mask image of a historical video frame adjacent to the first video frame;

in this embodiment of the present application, a first video frame may be any video frame obtained by a computer device, and in different video output scenes, the computer device may receive video stream data collected and sent by a video source (i.e., a terminal device that collects a video image), and determine a current video frame received by the computer device as the first video frame at different times, so as to implement object segmentation processing on the first video frame according to a video object segmentation manner provided in this embodiment, and accurately identify a foreground object in the first video frame, so as to implement tracking identification of the same foreground object in consecutive video frames, so as to meet a processing requirement of an application scene on a foreground object or a background image. In this case, the computer device may be a service device that implements video stream data transmission between different terminal devices, such as a communication server supporting a video conference in a video conference scenario.

In still other embodiments, the computer device may be a video source, that is, the computer device directly performs image capture, and uses the currently captured video frame as the first video frame, and executes the video object segmentation method proposed in this embodiment to calibrate the directly captured first video frame, so as to ensure that the content of each video frame output by the computer device is clear and reliable, so that, in a video conference scene, each computer device participating in the video conference can accurately identify a foreground object (e.g., a user participating in the video conference) in each captured video frame according to this processing manner, which facilitates the user to replace a background image in the video frame, and simultaneously ensures that the identified foreground object is unchanged, that is, on the basis of meeting the participation requirement of the video conference, privacy of the environment where the user is currently located can be achieved, but is not limited to the application requirement described in this embodiment, the present application is not described in detail herein.

It can be seen that, in different application scenarios, the product categories of the computer device may be different, and the specific implementation method for the computer device to acquire the first video frame is different, including but not limited to the above-described method for receiving the first video frame acquired and sent by other terminal devices, or directly performing image acquisition to obtain the first video frame.

Because the video is a dynamic image formed by a plurality of continuous frame view images, a target object is segmented from a continuously played video frame, compared with the object segmentation requirement of a static frame image, the requirement of the application on the object segmentation technology for realizing the purpose is higher, in order to ensure the segmentation precision and efficiency, when the video frame is subjected to object segmentation, the application adopts an online self-adaptive video object segmentation method, and the object segmentation result of the previous video frame and the current video frame (namely, the 0 th video frame) are utilized to carry out object segmentation to obtain the object segmentation result of the current video frame, so as to sequentially guide the realization of the object segmentation of the next video frame.

In practical application, if the target object in the video frame needs to be extracted, the obtained mask image of the target object may be multiplied by the video frame to obtain the target object region, the image value in the target object region remains unchanged, and the image value outside the target region is 0. If it is necessary to mask an object in a video frame, the mask image of the object may be used to mask the object on the video frame so that the object does not participate in subsequent processing or only processes the object.

Therefore, in order to realize accurate identification of the foreground object of the first video frame, a mask image of the foreground object output by a previous video frame of the first video frame, that is, a history mask image of a history video frame adjacent to the first video frame, may be obtained.

Step S12, according to a first segmentation mode, carrying out object segmentation processing by using the historical mask image and the first video frame to obtain a first mask image of the first video frame;

in this embodiment of the present application, the first segmentation method may include, but is not limited to, the above-described online adaptive video object segmentation method, and specifically, the first segmentation method may be implemented by selecting a suitable machine learning algorithm or a suitable deep learning algorithm in artificial intelligence according to an application requirement.

Step S13, according to a second segmentation mode, carrying out object segmentation processing by using the first video frame to obtain a calibration mask image of the first video frame;

in practical application of the present application, if an object in a video shooting environment moves fast, is blocked, rotates, and the like, a first mask image of a foreground object obtained according to the first segmentation method is often unreliable, that is, an object region included in the first mask image may not include a complete target object, or may include other objects or a partial background region in addition to the target object, and in this case, if the first mask image continues to guide object segmentation of a next video frame, a segmentation result will be worse and worse.

In order to solve the above problem, the present application proposes to perform reliability verification on the first mask image, and when determining a reference image for implementing the reliability verification, in an embodiment of the present application, an object segmentation method applied to a still image is adopted to perform object segmentation processing on an image of the first video frame, that is, according to a second segmentation method, object segmentation is performed by using image content included in the first video frame itself to obtain a mask image of a foreground object, which is used as a calibration mask image of the first video frame to implement reliability verification or calibration on the first mask image obtained according to the first segmentation method.

In the implementation process of the second segmentation mode, the object segmentation processing can be implemented by adopting a depth learning/machine learning algorithm, and when the object segmentation method is different from that implemented by the first segmentation mode, the second segmentation mode does not refer to the historical mask image of the last historical video frame, so that even if the historical mask image is unreliable, the object segmentation result of the first video frame is not influenced, namely, the object segmentation result of the first video frame is not accurate due to the unreliable historical mask image, and the relative accuracy and reliability of the calibration mask image of the first video frame obtained according to the second segmentation mode are ensured.

It should be noted that, in the present application, detailed implementation processes of the first segmentation method and the second segmentation method for respectively implementing object segmentation processing are not described in detail, and segmentation algorithms according to which each implementation process is implemented may be determined according to circumstances.

In step S14, in the case where the comparison result of the calibration mask image and the first mask image satisfies the video segmentation calibration condition, the calibration mask image is output as the target mask image of the first video frame.

In order to avoid the above description, in the present application, in order to avoid an unreliable and inaccurate mask image to guide object segmentation of a next video frame and adverse effects on a segmentation result, after two mask images of a first video frame, that is, a first mask image and a calibration mask image, are obtained according to the different segmentation modes, the first mask image and the calibration mask image may be compared with each other, so as to determine whether the first mask image is accurate, and further determine a target mask image of the currently output first video frame.

Specifically, the present application may pre-configure the calibration condition for determining that the mask image of the current video frame is inaccurate according to the actual requirement, and the object segmentation of the subsequent video frame is guided, which may result in increasingly poor object segmentation effect. In this way, when the comparison result between the calibration mask image and the first mask image satisfies the video segmentation calibration condition, the first mask image may be considered to be inaccurate, and the object segmentation effect may be reduced when guiding the object segmentation of the future video frame.

It can be understood that, after the target mask image of the first video frame is determined in the above manner, the target mask image may be stored in association with a frame identifier (such as a frame number) of the first video frame, and the target mask image may be used as a historical mask image of an adjacent historical video frame of a next video frame to implement object segmentation processing on the next video frame.

In some embodiments, in the object segmentation process, only the historical mask image guidance of the previous historical video frame is needed, the historical mask image before a longer time is not needed, and a plurality of historical frame mask images are not needed in the subsequent process, so that in order to save resources, only the target mask image of the current video frame can be reserved to guide the object segmentation process of the future video frame, or a plurality of frame mask images close to the current frame are updated and reserved; of course, in still other embodiments, if the application scene requires mask images corresponding to multiple consecutive video frames, the mask images need to be retained in the embodiment for being called by subsequent applications. Therefore, the storage mode of the application for a plurality of continuous mask images of the whole video can be determined according to the situation.

In summary, in the embodiment of the present application, after the historical mask image of the first video frame and the historical mask image of the adjacent historical video frame are obtained, according to the first segmentation method, the historical mask image and the first video frame are used to perform the object segmentation process to obtain the first mask image of the first video frame, so as to avoid the inaccuracy and unreliability of the first mask image and reduce the object segmentation effect of the first video frame under its guidance, the present application further performs the object segmentation process according to the second segmentation method by using the first video frame to obtain the calibration mask image of the first video frame, so that the obtaining process of the calibration mask image is independent of the historical mask image of the previous video frame, and the calibration mask image is ensured not to be adversely affected by the historical mask image, so that the calibration mask image is compared with the first mask image, and when the obtained comparison result meets the video segmentation calibration condition, the first mask image is not accurate, the object segmentation of the next video frame is not guided by the first mask image, and the calibration mask image is directly used as the target mask image of the first video frame to be output, so that the object segmentation effect of the next video frame guided by the target mask image is ensured, and the technical problems that the video object segmentation is realized only by adopting a first segmentation mode under the conditions that the video object moves rapidly and is shielded, the target object cannot be reliably identified, the object segmentation effect is poor, and the subsequent application requirement on the segmentation result cannot be met are solved.

Referring to fig. 2, a schematic flowchart of another optional example of the video object segmentation method proposed in the present application, the present embodiment may be an optional detailed implementation of the video object segmentation method described in the foregoing embodiment, specifically, a detailed description of the first segmentation method and the second segmentation method, but is not limited to the detailed implementation method described in the present embodiment. As shown in fig. 2, the video object segmentation method proposed in this embodiment may include:

step S21, acquiring a first video frame and a historical mask image of a historical video frame adjacent to the first video frame;

for a specific implementation process of step S21, reference may be made to the description of the corresponding parts in the foregoing embodiments, which are not described herein again.

Step S22, inputting the historical mask image and the first video frame into a first segmentation model, and outputting a first mask image of the first video frame;

step S23, inputting the first video frame into the second segmentation model, and outputting the calibration mask image of the first video frame;

in combination with the above description about the first segmentation mode and the second segmentation mode, the present application may perform model training by using a suitable deep learning/machine learning algorithm to obtain a segmentation model representing an object segmentation rule of the corresponding segmentation mode, so as to implement object segmentation processing on each video frame and output a corresponding segmentation result, such as a mask image of a foreground object of the corresponding video frame.

It can be understood that, because the first segmentation mode and the second segmentation mode are different in information content according to the object segmentation process performed on the video frame, in the process of training the corresponding segmentation models, the input sample data content is different, and for convenience of description, the segmentation model corresponding to the first segmentation mode is referred to as a first segmentation model, and the segmentation model corresponding to the second segmentation mode is referred to as a second segmentation model.

Specifically, in the training process of the first segmentation model, the model is input into a plurality of sample video frames and a plurality of sample mask images, the sample mask images of the previous sample video frames and the current sample video frames are input into the initial segmentation model for training and learning until the training termination condition is met, and if the training preset times or the accuracy of the mask images output by the model meets the preset requirement, the model obtained by final training can be recorded as the first segmentation model. As can be seen, the input of the first segmentation model is a four-channel input consisting of mask + RGB of the last video frame.

The initial segmentation model can be a machine learning/deep learning network such as an initial neural network or a standard segmentation model in the video field, and the later segmentation model can be used for obtaining the required first segmentation model only by training and adjusting for a plurality of times.

Similarly, the training process of the second segmentation model is similar, and the difference is that when the second segmentation model is trained, only the sample video frame is input into the initial segmentation model for training and learning until the corresponding training termination condition is met, so that the required second segmentation model is obtained. It can be seen that the input of this second split model is an RGB three channel input. It should be noted that, the present application does not describe in detail the specific training implementation process of the first segmentation model and the second segmentation model, and is not limited to the above-described training manner.

Based on the above description of the first segmentation model and the second segmentation model, after obtaining the first video frame, the computer device may query the historical mask image of the previous historical video frame adjacent to the first video frame, and then perform the object segmentation processing on the first video frame according to the above manners of step S22 and step S23, respectively, to obtain two mask images of the first video frame, that is, the first mask image and the calibration mask image.

It can be understood that, under the condition that the video object does not move rapidly, is blocked, rotates, and the like, the object segmentation processing is performed on the video frame similar to the static image, and the object segmentation effect of the first mask image output by the first segmentation model may be better and the identified foreground object is more accurate than that of the calibration mask image output by the second segmentation model.

However, when the problems listed herein occur, the relative position between the foreground objects of the adjacent video frames may change greatly, so that the recognition accuracy of the foreground objects in the historical mask images is not high, the segmentation effect of the first mask image obtained by guidance may not be ideal, and if the object segmentation effect of the next video frame is further poor, in order to avoid this situation, the present application proposes to adopt the method of step S23 to obtain the calibration mask image for determining whether the segmentation result of the first mask image is accurate, but is not limited to the implementation of step S23 to obtain the calibration mask image,

in step S24, in the case where the comparison result of the calibration mask image and the first mask image satisfies the video segmentation calibration condition, the calibration mask image is output as the target mask image of the first video frame.

For a specific implementation process of step S24, reference may be made to the description of the corresponding parts in the above embodiments, which is not described herein again.

In summary, in the embodiment of the present application, for obtaining the first video frame, different segmentation models are used to perform object segmentation processing, that is, the historical mask image and the first video frame are input into the first segmentation model to obtain a first mask image of the foreground object in the first video frame, and the first video frame is input into the second segmentation model to obtain another mask image of the foreground object in the first video frame, which is recorded as the calibration mask image. And then, comparing the two mask images of the foreground object of the first video frame, and if the comparison result meets the video segmentation calibration condition, indicating that the foreground object segmentation effect of the first mask image obtained by the first segmentation model is not good.

Referring to fig. 3, a schematic flowchart of a further optional example of the video object segmentation method proposed in the present application, this embodiment may be a further optional detailed implementation of the video object segmentation method described in the foregoing embodiment, specifically, how to implement the comparison between the calibration mask image and the first mask image and how to determine whether the obtained comparison result satisfies the video segmentation calibration condition in the foregoing embodiment, but is not limited to such a detailed implementation described in this embodiment. As shown in fig. 3, the video object segmentation method proposed in this embodiment may include:

step S31, acquiring a first video frame and a historical mask image of a historical video frame adjacent to the first video frame;

step S32, inputting the historical mask image and the first video frame into a first segmentation model, and outputting a first mask image of the first video frame;

for specific implementation processes of step S31 and step S32, reference may be made to the description of corresponding parts in the foregoing embodiments, which are not described herein again.

Step S33, setting the historical mask image to zero to obtain the target historical mask image;

step S34, inputting the target historical mask image and the first video frame into a third segmentation model, and outputting a calibration mask image of the first video frame;

based on the above description about the second segmentation method, the present application is expected to be not affected by the historical mask image of the previous historical video frame when performing object segmentation processing on the first video frame according to the segmentation method, so for the segmentation model with four input channels, the embodiment of the present application proposes to set the historical mask image to zero, that is, the input of the mask channel is zero, and as compared with the segmentation model without the input mask image, the zero-input mask and the RGB three-channel data of the first video frame are input into the third segmentation model, and each network layer of the third segmentation model only performs analysis processing on the RGB data of the first video frame, thereby avoiding that the calibration mask image of the output first video frame is not adversely affected by the historical mask image.

As can be seen, the inputs of the third segmentation model and the first segmentation model in this embodiment are all four-channel inputs, and the difference is the input content of the input channel where the mask is located. Therefore, in practical applications, the third segmentation model may be the same as the first segmentation model described above; of course, the third segmentation model can be obtained by retraining, so that the method is more suitable for realizing the scene of object segmentation of the video frame accurately under the condition of no mask data, and the third segmentation model is different from the first segmentation model.

Step S35, acquiring first attribute information of the first mask image and second attribute information of the calibration mask image;

step S36, comparing the first attribute information and the second attribute information to obtain an attribute difference between object segmentation regions of the first mask image and the calibration mask image;

the embodiment of the application provides an optional implementation method for comparing the first mask image with the calibration mask image, namely, respective attribute information of the two mask images is compared, and since the mask image is identified to be actually the region where the foreground object is located, the attribute information of the mask image can be the attribute information of the foreground object of the first video frame.

In general, when a video object does not have problems of fast movement, non-occlusion, and the like, the difference of image contents between adjacent video frames is very small, and the difference between two mask images obtained according to the different segmentation methods is often small and may even be the same. However, if these problems occur, the foreground object identified by the historical mask image may be inaccurate, so that the identification accuracy of the foreground object of the first mask image is worse, but the identification accuracy of the foreground object of the calibration mask image is not affected.

For example, the attribute information of the mask images may include the shape, area, and the like of the region where the identified target object (such as the foreground object) is located, so that when comparing the respective target object regions of the two mask images, the shapes and/or areas of the respective identified target object regions may be compared, thereby determining the object segmentation effect difference between the two mask images, but is not limited to the content of the attribute information listed in this embodiment.

Step S37, detecting whether the attribute difference reaches the video segmentation calibration threshold value, if yes, entering step S38; if not, go to step S39;

step S38, outputting the calibration mask image as the target mask image of the first video frame;

in step S39, the first mask image is output as the target mask image of the first video frame.

It is understood that the content of the video segmentation calibration threshold may be different for different content of attribute information, and embodiments of the present application are not listed here.

In summary, after performing object segmentation on a first video frame according to different segmentation methods to obtain two mask images, comparing attribute information of the two mask images, and if the obtained attribute difference reaches a video segmentation calibration threshold, determining that a segmentation result output by a determined first segmentation model for performing object segmentation on each video frame is inaccurate, and taking a segmentation result output by a third segmentation model as an object segmentation result of the first video frame, namely, a target mask image; on the contrary, if the obtained attribute difference does not reach the video segmentation calibration threshold, the segmentation result output by the first segmentation model is determined to be accurate, and the output first mask image is still used as the target mask image of the first video frame, so that the accuracy of the object segmentation result of each video frame is ensured, and the application requirement realized according to the object segmentation result in the follow-up process is better met.

It can be understood that the implementation manner of the step S35 to the step S39 in this embodiment may be an optional detailed implementation manner of the step S14 and the step S24, that is, the step S14 and the step S24 are replaced to obtain further embodiments of the video object segmentation method, and a specific implementation process of this embodiment is not described herein again.

Referring to fig. 4, which is a schematic flowchart of yet another optional example of the video object segmentation method proposed in the present application, this embodiment may be a further detailed implementation method that implements comparison between a calibration mask image and a first mask image in the video object segmentation method described in the foregoing embodiment to determine whether a result of the comparison satisfies a video segmentation calibration condition, as shown in fig. 4, the method may include:

step S41, acquiring a first video frame and a historical mask image of a historical video frame adjacent to the first video frame;

step S42, according to a first segmentation mode, carrying out object segmentation processing by using the historical mask image and the first video frame to obtain a first mask image of the first video frame;

step S43, according to a second segmentation mode, carrying out object segmentation processing by using the first video frame to obtain a calibration mask image of the first video frame;

for the specific implementation process of step S41 to step S43, reference may be made to the description of the corresponding parts in the foregoing embodiments, which are not described herein again.

Step S44, performing pixel statistics on the object segmentation regions included in the first mask image and the calibration mask image respectively to obtain a first region area of the object segmentation region in the first mask image and a second region area of the object segmentation region in the calibration mask image;

step S45, performing a difference operation on the first region area and the second region area to obtain a region area difference between the respective object segmentation regions of the first mask image and the calibration mask image;

in step S46, if the area difference is greater than the video segmentation calibration threshold, the calibration mask image is output as the target mask image of the first video frame.

It can be seen that, in the embodiment of the present application, the description is given by taking the attribute information of the mask image as an example of the area of the identified target object, and since the area of the image area may be represented by the number of pixel points included in the area, the embodiment may perform pixel statistics on the object segmentation areas (such as the foreground object area in the first video frame, and specifically, the portrait area of the conference member in the video conference) included in the first mask image and the calibration mask image, so as to obtain the area areas of the object segmentation areas identified by the first mask image and the calibration mask image.

And comparing the two regions obtained by statistics, namely actually comparing the number of pixel points included in the object segmentation regions obtained by the two segmentation modes, wherein the integrity of the identified segmentation objects needs to be ensured, so that the first mask image usually includes a finished target object, and in some abnormal situations, the first mask image may include an unnecessary object, so that the object segmentation region obtained by the first mask image is larger than the target object region.

Based on this, in the embodiment of the present application, the area difference of the first area may be obtained by subtracting the area of the second area from the area of the first area, and it is determined whether the area difference of the first area is greater than the video segmentation calibration threshold, that is, whether the area of the first area is greater than the area of the second area and the area of the first area is greater than the video segmentation calibration threshold, if so, it may be determined that the first mask image includes a non-target object and the object segmentation result is inaccurate, and the calibration mask image is output as the target mask image of the first video frame; on the contrary, the object segmentation result of the first mask image may be considered to be accurate, and the first mask image may be output as the target mask image of the first video frame, so as to ensure the accuracy of the mask image output for each video frame, that is, the accuracy of the object segmentation result of each video frame.

Based on the video object segmentation method described in the foregoing embodiments, referring to fig. 5, which is a flowchart of yet another optional example of the video object segmentation method provided in the present application, this embodiment may further limit and explain when to verify the reliability or stability of the first mask image of the first video frame obtained by the first segmentation method. As shown in fig. 5, the method may include:

step S51, acquiring a first video frame and a historical mask image of a historical video frame adjacent to the first video frame;

step S52, according to a first segmentation mode, carrying out object segmentation processing by using the historical mask image and the first video frame to obtain a first mask image of the first video frame;

for specific implementation processes of step S51 and step S52, reference may be made to the description of corresponding parts in the foregoing embodiments, which are not described herein again.

Step S53, acquiring the segmentation state information of the first video frame;

step S54, detecting whether the segmentation state information meets the video calibration condition, if yes, entering step S55, if no, executing step S59;

step S55, according to a second segmentation mode, carrying out object segmentation processing by using the first video frame to obtain a calibration mask image of the first video frame;

in the embodiment of the present application, in order to ensure the accuracy of the object segmentation result of each video frame, each video frame may be processed according to two segmentation methods in the manner described in the above embodiment, so as to determine the target mask image of the video frame.

However, the processing method of the calibration verification in the above embodiment may consume more resources of the computer device in real time, so as to save resource occupation of the computer device, and meanwhile, considering that the obtained object segmentation result is not accurate according to the first segmentation method for each video frame, the embodiment of the present application sets a video calibration condition based on this, that is, determines a condition that the calibration verification needs to be performed on the current video frame according to the object segmentation result (i.e., the first mask image) obtained according to the first segmentation method, and the content of the condition is not limited in the present application.

Thus, only under the condition that the video calibration condition is met, another segmentation mode is adopted to segment the current video frame, and the object segmentation result is used for calibration verification. If the video calibration condition is not met, the target mask image of the first video frame is obtained only according to the processing mode of the step S52, and other segmentation modes do not need to be executed, so that the occupation of computer equipment resources by the execution of other segmentation modes is saved, and the video frame object segmentation efficiency is improved.

Based on the above analysis, after the first video frame is obtained, the segmentation status information of the first video frame may be directly obtained, or after the first mask image is obtained, the segmentation status information of the first video frame may be obtained, and the execution sequence of the step S53 and the step S52 is not limited in this application. The content of the segmentation status information can be determined according to the content of the video calibration condition, which is not described in detail in this application.

In one possible implementation, the present application may perform step S55 at certain time intervals for verifying the first mask image obtained in step S52. In this case, the segmentation state information may be an acquisition time point of the first video frame, and the video calibration condition may include that a calibration time interval between the last video calibration time point and the current time reaches a preset video calibration time interval threshold.

Based on this, the detection process of step S54 may include: acquiring a calibration time interval corresponding to a first video frame, namely a time interval between the acquisition time point of the first video frame and the last video calibration time point, and then determining whether the calibration time interval reaches a video calibration time interval threshold value, wherein if the calibration time interval reaches the video calibration time interval threshold value, the segmentation state information of the first video frame can be considered to meet the video calibration condition; if not, the segmentation state information of the first video frame is considered not to satisfy the video calibration condition.

In yet another possible implementation manner, unlike the above-mentioned timing manner, i.e., the manner of counting the calibrated time interval, the present embodiment may also count a frame number difference between a frame number of a video frame (which may be sequentially determined based on a frame number of a total video frame included in the entire video) of the last execution of the second division manner and a frame number of the first video frame, and sequentially determine whether to execute the second division manner for the first video frame. In this case, the segmentation status information may be a frame number of the first video frame, and the video calibration condition may include: the frame number of the video frame (i.e. the video frame of the last video segmentation calibration) of the last execution of the second segmentation mode and the frame number interval of the first video frame reach the video calibration frame number interval threshold, and the specific numerical value of the video calibration frame number interval threshold is not limited in the present application.

Based on this, the detection process of step S54 may include: acquiring a frame number interval between a first video frame and a video frame of the last video calibration, and determining whether the frame number interval reaches a video calibration frame number interval threshold value; if so, the segmentation state information of the first video frame can be considered to meet the video calibration condition; if not, the segmentation state information of the first video frame can be considered not to satisfy the video calibration condition.

It should be noted that the content of the segmentation status information of the video frame and the content of the video calibration condition corresponding to the segmentation status information are not limited to the two implementations listed above.

Step S56, comparing the calibration mask image with the first mask image to obtain a comparison result;

step S57, detecting whether the comparison result meets the video segmentation calibration condition, if yes, entering step S58, if no, executing step S59;

step S58, outputting the calibration mask image as the target mask image of the first video frame;

in step S59, the first mask image is output as the target mask image of the first video frame.

For the specific implementation process of step S56 to step S59, reference may be made to, but not limited to, the description of the corresponding parts in the foregoing embodiments, and this embodiment is not described herein again.

In still other embodiments, in a case that the comparison result satisfies the video segmentation calibration condition, the application may further adjust model parameters of the first segmentation model according to the comparison result between the calibration mask image and the first mask image and data of the first video frame (refer to the model training process described above), so as to improve accuracy of the object segmentation result output by the model, and thus continue the object segmentation process on the next video frame by using the adjusted first segmentation model. As can be seen, when the comparison result satisfies the video segmentation calibration condition, the method is not limited to the processing method of deleting the first mask image and outputting the calibration mask image as the target mask image of the first video frame.

In summary, in the process that the computer device performs object segmentation processing on the historical mask image and the first video frame according to the first segmentation mode to obtain the first mask image, if it is detected that the segmentation state information of the first video frame meets the video calibration condition, the computer device performs object segmentation processing on the first video frame again according to the second segmentation mode to obtain a calibration mask image thereof, so as to determine whether the object segmentation result of the first mask image is accurate; if the fact that the segmentation state information of the first video frame does not meet the video calibration condition is detected, the first mask image is directly output, and unnecessary consumption of computer equipment resources caused by object segmentation processing on the first video frame again according to the second segmentation mode under the condition that the first mask image is accurate is avoided.

Based on the video object segmentation method described in the foregoing embodiments, which is described by taking a video conference scene as an example, the video object segmentation method provided in the foregoing embodiments of the present application may be performed on a frame of image acquired by any terminal device participating in a video, so as to accurately identify a portrait area in the frame of image, so that, in the face of problems such as rapid movement of a user, the object segmentation result obtained by using the first segmentation method may include other objects in an environment where the user is located, such as a table and a chair, except for the portrait area of the user, and as a result, the object segmentation result is inaccurate and cannot meet subsequent application requirements. The inaccurate object segmentation result can be clarified, and the relatively accurate object segmentation result obtained in the second segmentation mode is output as the target object segmentation result of the frame image.

Based on the obtained target object segmentation result, if a user does not want other participating users to see some contents in the image, such as the environment occupied by the user, the identified portrait area can be superposed to another selected background image, so that the user seems to participate in the video conference under the environment shown by the background image; of course, the occlusion of the content of a specific area can also be realized according to the object segmentation result, and the like, and can be determined according to the requirements of a specific application scene, and the processing method after the object segmentation result is not detailed in the application.

Referring to fig. 6, a schematic structural diagram of an alternative example of the video object segmentation apparatus proposed in the present application, which may be applied to the electronic device, as shown in fig. 6, the apparatus may include:

an obtaining module 61, configured to obtain a first video frame and a historical mask image of a historical video frame adjacent to the first video frame;

a first segmentation processing module 62, configured to perform object segmentation processing on the historical mask image and the first video frame according to a first segmentation mode to obtain a first mask image of the first video frame;

a second segmentation processing module 63, configured to perform object segmentation processing on the first video frame according to a second segmentation method to obtain a calibration mask image of the first video frame;

a first output module 64, configured to output the calibration mask image as a target mask image of the first video frame when a comparison result of the calibration mask image and the first mask image satisfies a video segmentation calibration condition.

In some embodiments, the first segmentation processing module 62 may specifically include:

the first model processing unit is used for inputting the historical mask image and the first video frame into a first segmentation model and outputting a first mask image of the first video frame;

the second division processing module 63 may include:

the second model processing unit is used for inputting the first video frame into a second segmentation model and outputting a calibration mask image of the first video frame;

or, the third model processing unit is configured to set the historical mask image to zero to obtain a target historical mask image, input the target historical mask image and the first video frame into a third segmentation model, and output a calibration mask image of the first video frame.

Based on the above description, in further embodiments, the video object segmentation apparatus proposed in the present application may further include:

and the model adjusting module is used for adjusting the model parameters of the first segmentation model according to the comparison result of the calibration mask image and the first video frame under the condition that the comparison result of the calibration mask image and the first mask image meets the video segmentation calibration condition so as to continue to perform object segmentation processing on the next video frame by using the adjusted first segmentation model.

In some embodiments, the target mask image output module 64 may include:

an attribute information acquisition unit configured to acquire first attribute information of the first mask image and second attribute information of the calibration mask image;

an attribute comparison unit, configured to compare the first attribute information and the second attribute information to obtain an attribute difference between object segmentation regions of the first mask image and the calibration mask image;

a detection unit for detecting whether the attribute difference reaches a video segmentation calibration threshold;

a first output unit configured to output the calibration mask image as a target mask image of the first video frame if a detection result of the detection unit is yes;

and the second output unit is used for outputting the first mask image as the target mask image of the first video frame under the condition that the detection result of the detection unit is negative.

In a possible implementation manner, the attribute information obtaining unit may include:

the pixel counting unit is used for respectively carrying out pixel counting on video object segmentation areas contained in the first mask image and the calibration mask image to obtain a first area of the object segmentation area in the first mask image and a second area of the object segmentation area in the calibration mask image;

accordingly, the attribute comparison unit may include:

and the area difference acquiring unit is used for performing difference operation on the first area and the second area to obtain the area difference between the object segmentation areas of the first mask image and the calibration mask image.

In still other embodiments, the video object segmentation apparatus provided in the present application may further include:

the segmentation state information acquisition module is used for acquiring the segmentation state information of the first video frame;

and the detection module is configured to detect whether the segmentation state information satisfies a video calibration condition, and if so, trigger the second segmentation processing module 63 to perform object segmentation processing by using the first video frame according to a second segmentation method to obtain a calibration mask image of the first video frame.

And the second output module is used for outputting the first mask image as the target mask image of the first video frame under the condition that the detection result of the detection module is negative.

Optionally, the detection module may include:

a time interval obtaining unit, configured to obtain a calibration time interval corresponding to the first video frame; wherein, the calibration time interval refers to the time interval between the acquisition time point of the first video frame and the last video calibration time point;

a first detection unit, configured to detect whether the calibration time interval reaches a video calibration time interval threshold;

in still other embodiments, the detection module may further include:

a frame interval acquiring unit, configured to acquire a frame interval between the first video frame and a video frame of a previous video calibration;

and the second detection unit is used for detecting whether the frame interval reaches a video calibration frame interval threshold value.

It should be noted that, various modules, units, and the like in the embodiments of the foregoing apparatuses may be stored in the memory as program modules, and the processor executes the program modules stored in the memory to implement corresponding functions, and for the functions implemented by the program modules and their combinations and the achieved technical effects, reference may be made to the description of corresponding parts in the embodiments of the foregoing methods, which is not described in detail in this embodiment.

The present application also provides a storage medium on which a computer program may be stored, where the computer program may be called and loaded by a processor to implement the steps of the video object segmentation method described in the above embodiments.

Referring to fig. 7, a schematic diagram of a hardware structure of an alternative example of a computer device suitable for the video object segmentation method and apparatus provided in the present application is shown, where the computer device may include a terminal device or a service device, and may be determined according to requirements of a specific application scenario. As shown in fig. 7, the computer apparatus may include: at least one memory 71 and at least one processor 72, wherein:

the memory 71 may be used to store a program for implementing the video object segmentation method described in the above method embodiments; the processor 72 may be configured to load and execute the program stored in the memory 71 to implement each step of the video object segmentation method described in the foregoing corresponding method embodiment, and a specific implementation process may refer to the description of the corresponding part in the foregoing embodiment, which is not described herein again.

In combination with the above description of the computer device, referring to fig. 8, which is a schematic view of a video conference scene applicable to the video object segmentation method and apparatus provided by the present application, when the computer device is a terminal device, and when the terminal device collects any frame of image, a target mask image of the frame of image can be determined according to the video object segmentation method provided by the present application, and a target object region, such as a portrait region, in the frame of image is accurately extracted for application requirements such as tracking identification of a target object or shielding of specific region information.

The method comprises the steps that a plurality of continuous video frames are acquired by terminal equipment, and according to a video object segmentation method provided by the application, a target mask image of each video frame is acquired and then can be sent to a communication server supporting normal operation of a video conference, the communication server processes the target mask image according to application requirements, and processed video streams are directly fed back to each terminal equipment participating in the video conference to be output or sent to other terminal equipment participating in the video conference to be output, so that different video conference interfaces are output by the terminal equipment and the other terminal equipment. Of course, the video stream may also be directly processed by the terminal device and then forwarded to other terminal devices for output through the communication server.

It is understood that in the case of a computer device being a terminal device, the computer may include, but is not limited to, a smart phone, a tablet, a wearable device, a Personal Computer (PC), a netbook, an Augmented Reality (AR) device, a Virtual Reality (VR) device, an in-vehicle device, a robot, a desktop computer, and the like. The terminal device included in the scenario shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As for the terminal device as listed above, at least one input device such as a touch sensing unit that senses a touch event on the touch display panel, a keyboard, a mouse, a camera, a microphone, and the like may also be included; at least one output device such as a display, speaker, vibration mechanism, light, etc.; various sensor components; the power management module, the antenna, and the like may be determined according to the product type and the functional requirement of the terminal device, which are not listed herein.

In some embodiments, if the computer device is a service device, that is, after the terminal device collects each frame of image, the frame of image may be sent to the service device to implement video object segmentation processing to obtain a target mask image of each required video frame, and then, the user may send a corresponding processing request to the service device according to a display content of the video, such as a display requirement on a personal window of a person in a video conference interface, so that the service device responds to the processing request to obtain video stream data about the user sent to other terminal devices participating in the video conference, and sends the video stream data about the user to the other terminal devices for output.

Optionally, after obtaining the target mask image of each video frame, the service device may also feed back the target mask image to the terminal device, so that the user uses the target mask image on the terminal device to process the image content of each video frame, so as to meet the display requirement of the user on the output video content, such as the requirement that the terminal device and other terminal devices output different video conference interfaces, and then send the processed video stream to other terminal devices for output.

In summary, the present application does not limit the product types of the computer devices, and the product types can be determined according to the requirements of specific scenes, and the video object segmentation method and apparatus provided in the present application are not only suitable for video conference scenes, but also can be applied to other video scenes to meet the requirements of other scenes, and the present application is not described in detail herein.

Moreover, it should be understood that the structure of the computer device shown in fig. 8 does not constitute a limitation of the computer device in the embodiment of the present application, and in practical applications, the electronic device may include more or less components than those shown in fig. 8, or some components may be combined, and the present application is not specifically described herein.

Finally, it should be noted that, in the present specification, the embodiments are described in a progressive or parallel manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device, the computer equipment and the system structure in the application scene disclosed by the embodiment correspond to the method disclosed by the embodiment, so that the description is relatively simple, and the relevant points can be obtained by referring to the description of the method part.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A video object segmentation method, the method comprising:

2. The method of claim 1, the comparing of the first mask image to the calibration mask image comprising:

the attribute difference reaches a video segmentation calibration threshold.

3. The method of claim 2, the obtaining first attribute information for the first mask image and second attribute information for the calibration mask image, comprising:

4. The method of any of claims 1-3, further comprising:

acquiring segmentation state information of the first video frame;

5. The method of claim 4, the segmentation status information satisfying a video calibration condition comprising:

or,

6. The method of claim 4, further comprising:

7. The method according to any one of claims 1 to 3, wherein the performing the object segmentation processing by using the historical mask image and the first video frame according to the first segmentation mode to obtain the first mask image of the first video frame comprises:

8. The method of claim 7, in the event that the comparison of the calibration mask image and the first mask image satisfies the video segmentation calibration condition, the method further comprising:

9. A video object segmentation apparatus, the apparatus comprising:

10. A computer device, the computer device comprising:

a memory for storing a program for implementing the video object segmentation method according to any one of claims 1 to 8;

a processor for loading and executing the program stored in the memory to implement the steps of the video object segmentation method according to any one of claims 1 to 8.