WO2021139625A1

WO2021139625A1 - Image processing method, image segmentation model training method and related apparatus

Info

Publication number: WO2021139625A1
Application number: PCT/CN2021/070167
Authority: WO
Inventors: 叶海佳; 何帅; 王文斓
Original assignee: 广州虎牙科技有限公司
Priority date: 2020-01-07
Filing date: 2021-01-04
Publication date: 2021-07-15
Also published as: CN111260679A; CN111260679B

Abstract

Embodiments of the present application relate to the technical field of artificial intelligence, and provided therein are an image processing method, image segmentation model training method and related apparatus. The method comprises: obtaining two training images that are adjacent in time sequence and optical flow information between the two training images as a training image set, and obtaining training annotation information corresponding to the training image set; then inputting the two training images into the image segmentation model to obtain two pieces of training mask information; and then according to the two pieces of training mask information, training annotation information and optical flow information, and updating the model parameters of the image segmentation model until the image segmentation model reaches the set convergence condition. As such, the image segmentation model can use the optical flow information to learn motion information between the images, so that the image segmentation model can combine the motion information between each image and other adjacent images, extract the mask information of the corresponding image, and ensure consistency between consecutive images when performing image fusion.

Description

Image processing method, image segmentation model training method and related device

Cross-references to related applications

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on January 7, 2020, with the application number 2020100143725 and titled "Image Processing Method, Image Segmentation Model Training Method and Related Device", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the field of artificial intelligence technology. Specifically, it provides an image processing method, an image segmentation model training method, and related devices.

Background technique

The matting technology refers to separating the foreground information and background information in an image, and then applying the obtained foreground information to other background information; using matting technology, the extracted foreground information can be combined with any background information For example, in the field of live broadcast, the extracted portrait information can be integrated with any background picture or video, thereby enhancing the user's experience of watching the live broadcast.

However, the current matting technology only separates the pixels of the portrait information from the background information, and obtains a mask containing only 0 and 1. When merging, the consistency between adjacent image frames is poor, resulting in poor video images. The object information may be jittery and so on.

Summary of the invention

The purpose of this application is to provide an image processing method, an image segmentation model training method and related devices, which can ensure consistency between consecutive images when performing image fusion.

In order to achieve at least one of the above objectives, the technical solutions adopted in this application are as follows:

An embodiment of the application provides an image segmentation model training method, the method includes:

Obtain a training image set and training annotation information corresponding to the training image set; wherein the training image set includes two training images that are adjacent in time series, and optical flow information between the two training images;

Input the two training images into the image segmentation model to obtain two training mask information;

According to the two training mask information, the training annotation information, and the optical flow information, the model parameters of the image segmentation model are updated until the image segmentation model reaches a set convergence condition.

An embodiment of the present application also provides an image processing method, the method including:

Receive the image to be processed and the background to be fused;

Input the to-be-processed image into the image segmentation model that is trained to converge using the above-mentioned image segmentation model training method provided in this application to obtain target mask information corresponding to the to-be-processed image;

The target mask information is used to process the to-be-processed image and the to-be-fused background to obtain a fused image.

An embodiment of the present application also provides an image segmentation model training device, the device includes:

The first processing module is configured to obtain a training image set and training annotation information corresponding to the training image set; wherein, the training image set includes two training images that are adjacent in time series, and one of the two training images Optical flow information between time;

The first processing module is further configured to input the two training images into the image segmentation model to obtain two training mask information;

An update module configured to update the model parameters of the image segmentation model according to the two training mask information, the training annotation information, and the optical flow information, until the image segmentation model reaches a set convergence condition .

An embodiment of the present application also provides an image processing device, which includes:

The receiving module is configured to receive the image to be processed and the background to be fused;

The second processing module is configured to input the image to be processed into the image segmentation model trained to converge using the image segmentation model training method provided in this application to obtain target mask information corresponding to the image to be processed;

The second processing module is further configured to use the target mask information to process the to-be-processed image and the to-be-fused background to obtain a fused image.

The embodiment of the present application also provides an electronic device, including:

The memory is configured to store one or more programs;

processor;

When the one or more programs are executed by the processor, the above-mentioned image segmentation model training method or image processing method provided in this application is implemented.

The embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the above-mentioned image segmentation model training method or image processing method provided in the present application is implemented.

Description of the drawings

In order to explain the technical solution of the present application more clearly, the drawings that need to be used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show certain embodiments of the present application, and therefore should not be It is regarded as a limitation of the scope. For those of ordinary skill in the art, other related drawings can be obtained based on these drawings without creative work.

Figure 1 shows a schematic structural block diagram of an electronic device provided by the present application;

Fig. 2 shows a schematic flowchart of the image segmentation model training method provided by the present application;

Figure 3 shows a schematic structural diagram of an image segmentation model;

FIG. 4 shows a schematic flowchart of the sub-steps of step 206 in FIG. 2;

FIG. 5 shows another schematic flowchart of the image segmentation model training method provided by the present application;

Fig. 6 shows a schematic diagram of an extraction method of optical flow information;

FIG. 7 shows still another schematic flowchart of the image segmentation model training method provided by the present application;

FIG. 8 shows a schematic flowchart of the image processing method provided by the present application;

Figure 9 shows a schematic diagram of image fusion before and after comparison;

FIG. 10 shows a schematic structural block diagram of the image segmentation model training device provided by the present application;

FIG. 11 shows a schematic structural block diagram of the image processing device provided by the present application.

In the figure: 100-electronic equipment; 101-memory; 102-processor; 103-communication interface; 400-image segmentation model training device; 401-first processing module; 402-update module; 500-image processing device; 501- Receiving module; 502-second processing module.

Detailed ways

In order to make the purpose, technical solutions and advantages of this application clearer, the technical solutions in this application will be described clearly and completely in conjunction with the drawings in some embodiments of this application. Obviously, the described embodiments are the present invention. Apply for some examples, but not all examples. The components of the present application generally described and shown in the drawings herein may be arranged and designed in various different configurations.

Therefore, the following detailed description of the embodiments of the present application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely represents some selected embodiments of the present application. Based on a part of the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.

It should be noted that similar reference numerals and letters indicate similar items in the following figures. Therefore, once a certain item is defined in one figure, it does not need to be further defined and explained in subsequent figures. At the same time, in the description of the present application, the terms "first", "second", etc. are only configured to differentiate descriptions, and cannot be understood as indicating or implying relative importance.

It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply one of these entities or operations. There is any such actual relationship or order between. Moreover, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements not only includes those elements, but also includes those that are not explicitly listed Other elements of, or also include elements inherent to this process, method, article or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or equipment that includes the element.

In the live broadcast field, for example, the matting technology can be used to separate foreground information such as the portrait of the host from the background information, and then merge the separated foreground information with other background information.

Among them, assuming that the foreground information separated from the matting is denoted as F and the fused background information is denoted as B, the fused image I can be expressed as: I=mF+(1-m)B.

In the formula, m represents the mask information corresponding to the foreground information F (mask).

Combining the above fusion formula of image I, it can be seen that since foreground information F and background information B are both input fixed values, the fusion effect of image I is mainly affected by the value of the mask information.

In some matting schemes, such as portrait binary semantic segmentation, in the process of obtaining mask information, the portrait binary semantic segmentation scheme generally understands the image from the semantic level, and classifies the information in the image from the semantic category For foreground pixels and background pixels, the result is a mask with a value range between 0 and 1.

However, in scenes such as webcasting, in the process of fusing foreground information such as the host’s portrait with other background information, since the webcast broadcasts a continuous video stream, it is not only necessary to consider the relationship between the foreground information and the background information. For fusion, it is also necessary to consider that the segmentation results of two consecutive frames cannot have too much deviation.

However, in the above-mentioned two-value semantic segmentation of portraits, it only considers the segmentation of a single frame image, and does not consider the consistency of the segmentation results of adjacent image frames in time sequence, so that it is applied to the portrait segmentation of scenes such as webcasts. , After the segmented foreground information is fused with other background information, the object information in the video picture generated by the fusion may appear jitter, which affects the user experience.

Therefore, in order to solve at least some of the shortcomings of the above-mentioned related solutions, some possible implementations provided by this application are: obtaining two training images that are adjacent in time sequence, and the optical flow information between the two training images is used as Train the image set, and obtain the training annotation information corresponding to the training image set; then input the two training images into the image segmentation model to obtain two training mask information; and then according to the two training mask information, training annotation information, and Optical flow information, update the model parameters of the image segmentation model until the image segmentation model reaches the set convergence conditions; enable the image segmentation model to combine the motion information between each image and other adjacent images to extract the corresponding image mask Information, and then when performing image fusion, it can ensure consistency between consecutive images.

Hereinafter, some embodiments of the present application will be described in detail with reference to the accompanying drawings. In the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.

Please refer to FIG. 1. FIG. 1 shows a schematic structural block diagram of an electronic device 100 provided in this application. The electronic device 100 can store an untrained image segmentation model to perform the image segmentation model training provided in this application. Method to complete the training of the image segmentation model; or, the electronic device 100 may store an image segmentation model trained to converge using the image segmentation model training method provided in this application, and use the training to converge image segmentation The model implements the image processing method provided in this application.

Among them, in some embodiments, the electronic device 100 may include a memory 101, a processor 102, and a communication interface 103. The memory 101, the processor 102, and the communication interface 103 are directly or indirectly electrically connected to each other to realize data exchange. Transmission or interaction. For example, these components can be electrically connected to each other through one or more communication buses or signal lines.

The memory 101 may be configured to store software programs and modules, such as the image segmentation model training device provided in this application or program instructions/modules corresponding to the image processing device. The processor 102 executes the software programs and modules stored in the memory 101, In this way, various functional applications and data processing are executed, and then the image segmentation model training method or the steps corresponding to the image processing method provided in this application are executed. The communication interface 103 may be configured to perform signaling or data communication with other node devices.

The memory 101 may be, but is not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read Only Memory, ROM), and programmable read-only memory (Programmable Read-Only Memory, PROM). Erasable Programmable Read-Only Memory (EPROM), Electric Erasable Programmable Read-Only Memory (EEPROM), etc.

The processor 102 may be an integrated circuit chip with signal processing capabilities. The processor 102 may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP), a dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.

It can be understood that the structure shown in FIG. 1 is only for illustration, and the electronic device 100 may also include more or less components than those shown in FIG. 1, or have a different configuration from that shown in FIG. 1. The components shown in FIG. 1 can be implemented by hardware, software, or a combination thereof.

Hereinafter, the electronic device 100 shown in FIG. 1 is used as an exemplary execution subject to exemplarily describe the image segmentation model training method provided in the present application.

Please refer to FIG. 2. FIG. 2 shows a schematic flowchart of the image segmentation model training method provided by the present application. In some embodiments, the image segmentation model training method may include the following steps:

Step 202: Obtain a training image set and training annotation information corresponding to the training image set.

Step 204: Input two training images into the image segmentation model to obtain two training mask information.

Step 206: Update the model parameters of the image segmentation model according to the two training mask information, the training label information, and the optical flow information, until the image segmentation model reaches the set convergence condition.

In some embodiments, an image segmentation model as shown in FIG. 3 may be stored in the electronic device. The image segmentation model can process the input image and output the mask information of the corresponding image; wherein, the image segmentation model adopts The network structure may be a Unet network, or a segmentation network such as Deeplabv3, SEGNET, etc. The application does not limit the network structure of the image segmentation model.

In the process of training the image segmentation model, the electronic device can first obtain the training image set and the training annotation information corresponding to the training image set, where the training image set includes two training images that are adjacent in time series, as shown in Figure 3. In I ₀ and I ₁ , and the optical flow information between the two training images, the optical flow information represents the motion cues between the two training images, that is, the correlation between _{I 0} and I _1.

Then, as shown in Figure 3, the electronic device can _{input the two training images I 0} and I ₁ into the image segmentation model to obtain two training mask information; for example, in the scene shown in Figure 3, the mask information corresponding to _{I 0} _{It may be Mask 0} in FIG. 3, and the mask information corresponding to I ₁ _{may be Mask 1} in FIG. 3.

Finally, the electronic device can update the model parameters of the image segmentation model by using, for example, a backpropagation algorithm (BP algorithm) _{based on the two training mask information, training annotation information, and optical flow information of Mask 0} and Mask _{1, for example, until} The image segmentation model reaches the set convergence condition; among them, since the optical flow information represents the motion information between the two training images, correspondingly, the mask information corresponding to the two training images also has the optical flow Information representation of the motion information; in this way, in the process of updating the model parameters of the image segmentation model, the image segmentation model can use optical flow information to learn the motion information between the two training images, so that the image segmentation model is extracting the target image When the mask information of the target image is combined, the mask information of other images adjacent to the target image can be extracted, so as to maintain the consistency between adjacent images.

It can be seen that, based on the above design, an image segmentation model training method provided by this application is obtained by obtaining two training images adjacent in time sequence and the optical flow information between the two training images as a training image set, and obtaining The training annotation information corresponding to the training image set; then the two training images are input into the image segmentation model to obtain two training mask information; and then according to the two training mask information, training annotation information and optical flow information, the image is updated The model parameters of the segmentation model until the image segmentation model reaches the set convergence conditions; in this way, in the process of training the image segmentation model, the image segmentation model can use the optical flow information to learn the motion information between the images, so that the image segmentation model can Combining the motion information between each image and other adjacent images, extract the mask information of the corresponding image, and then when performing image fusion, it can ensure consistency between consecutive images.

Among them, it should be noted that the two training images that are adjacent in time sequence obtained by the electronic device generally include a first image that is earlier in time sequence and a second image that is later in time sequence, such as the I in FIG. 3 that is earlier in time sequence. ₀ can be used as the first image, and I _{1 with a} later sequence can be used as the second image.

Correspondingly, the two training mask information output by the image segmentation model includes the first training mask information corresponding to the first image and the second training mask information corresponding to the second image, such as I _{0 in Figure 3} the mask ₀ can be used as mask information of the first training, I ₁ mask corresponding to _a mask can be used as the second training information.

In addition, it should be noted that optical flow is the movement of the target caused by the movement of the target, scene, or camera between two consecutive frames of images; optical flow information is a vector information, and optical flow is generally divided into forward optical flow and backward optical flow. For example, between the two frames of images shown in Figure 3, _{the timing of image I 0} is before image I _1. For image I ₁ , the optical flow information from image I ₀ to image I ₁ is the backward optical flow. , It records the movement direction and speed _{of the image I 0} to the image I ₁ _{; the optical flow information from the image I 1} to the image I ₀ is the forward optical flow, and it records the movement direction _{of the image I 1} to the image I ₀ And rate.

Moreover, the foregoing is only for illustration. Among the two adjacent training images, the image with the earlier time sequence is used as the first image, and the image with the later time sequence is used as the second image; in some other possible embodiments of the present application, it is also possible to use The later in the sequence is regarded as the first image, and the earlier in the sequence is regarded as the second image; this application does not limit this.

Optionally, in some embodiments, the training annotation information obtained by the electronic device may be the annotation mask information of the second image. For example, according to the scenario shown in FIG. 3, the training annotation information obtained by the electronic device may be the image I ₁ The mask information is marked; accordingly, the optical flow information obtained by the electronic device may be the backward optical flow _{for the image I 1.}

In this way, on the basis of FIG. 2, please refer to FIG. 4. FIG. 4 shows a schematic flowchart of the sub-steps of step 206 in FIG. 2. In some possible implementation manners, step 206 may include the following sub-steps:

Step 206-1: Obtain the content loss of the image segmentation model according to the labeling mask information, the second training mask information, and the second image.

Step 206-2: Obtain the timing loss of the image segmentation model according to the optical flow information, the first image, the second image, the first training mask information, and the second training mask information.

Step 206-3, based on the content loss and the timing loss, update the model parameters of the image segmentation model.

In the process of performing step 206 to update the model parameters of the image segmentation model, the electronic device may divide the loss function of the image segmentation model into two parts: content loss and timing loss.

For example, in some embodiments, the loss function of the image segmentation model may satisfy the following formula:

L=Lc+Lst

In the formula, L can represent the total loss of the image segmentation model, Lc can represent the content loss, and Lst can represent the timing loss.

Among them, the content loss constraint is the second training mask information output by the image segmentation model and the actual mask information of the second image, and the content loss guarantees the accuracy of the segmentation result.

In this way, in the process of performing step 206, the electronic device can obtain the content loss of the image segmentation model according to the labeling mask information, the second training mask information, and the second image, that is, calculating the second training mask information and the labeling mask. The difference between film information.

For example, in some possible implementations, the calculation formula for content loss may satisfy the following:

In the formula, Lc represents the content loss, mask _gt represents the labeling mask information, mask1 _pre represents the second training mask information, and I ₁ represents the second image.

On the other hand, the timing loss constrains the motion information between the two frames of images, ensuring that the mask information corresponding to the two frames of images can be kept consistent in time sequence.

In this way, in the process of performing step 206, the electronic device obtains the timing loss of the image segmentation model according to the optical flow information, the first image, the second image, the first training mask information, and the second training mask information.

For example, in some possible implementations, the timing loss calculation formula can satisfy the following:

In the formula, Lst represents the timing loss, α represents the set parameter,

I ₀ represents the first image, warp ₀₁ represents optical flow information, and mask0 _pre represents the first training mask information.

In this way, based on the content loss and timing loss obtained above, the electronic device can use a summation method to calculate the sum of the content loss and the timing loss as the total loss of the image segmentation model, and then use the calculated loss based on, for example, the BP algorithm. The total loss updates the model parameters of the image segmentation model; through continuous iterative training, until the image segmentation model reaches the set convergence condition.

It should be noted that the above formulas for calculating content loss, time sequence loss, and total loss of the image segmentation model are only for illustration. In some other possible embodiments of this application, for example, some other formulas may be used to calculate the above various losses. , This application does not limit this.

In addition, in the above solution provided by this application, the electronic device may first perform step 206-1 to obtain the content loss, and then perform step 206-2 to obtain the timing loss; or it may perform step 206-2 first to obtain the content loss, and then perform step 206-1 obtains content loss; this application does not limit the order of execution of step 206-1 and step 206-2; for example, in another possible implementation manner, it can also be step 206-1 and step 206 -2 are executed together.

Moreover, it should be noted that optical flow is a two-dimensional vector field in the translation process of an image. It uses a two-dimensional image to represent the velocity field of an object point in three-dimensional motion, which reflects the image change formed by motion in a small time interval. , To determine the direction and rate of motion on the image point, so that the optical flow can be configured to provide clues to restore image motion.

In the process of training the image segmentation model, the electronic device can obtain the optical flow information between the two training images through online extraction, so as to reduce the workload of the user in the process of training the image segmentation model.

Therefore, on the basis of FIG. 2, please refer to FIG. 5. FIG. 5 shows another schematic flowchart of the image segmentation model training method provided by the present application. Before step 202 is performed, the image segmentation model training method may also It includes the following steps:

Step 201: Extract the inter-frame optical flow between two training images to obtain optical flow information.

In some embodiments, as shown in FIG. 6, the electronic device may use, for example, a selflow algorithm to extract the inter-frame optical flow between two training images to obtain optical flow information.

For example, in the above example _{where the backward optical flow for the image I 1 is} used as the optical flow information, the electronic device can take the image I ₀ and the image I ₁ as input, and use the selflow algorithm to extract the backward frame of the _{image I 1} Time optical flow, thereby obtaining optical flow information.

However, it should be noted that in the embodiment of obtaining optical flow information using the online extraction method described above, the training time of the image segmentation model is lengthened due to the need to perform the step of extracting optical flow information online; When the segmentation model is iteratively trained, since step 201 needs to be repeated, the operation of extracting optical flow information may be repeated on the same set of two training images.

Therefore, in another possible embodiment of the present application, the optical flow information can also be obtained by offline extraction, that is, step 201 can be performed first, and after the optical flow information of the two training images in each group is obtained, the optical flow information can be obtained. Take the two training images of each group and the corresponding optical flow information as the input of the image segmentation model, and then perform the training process of the image segmentation model; at this time, in the process of training the image segmentation model, there is no need to perform steps Obtaining optical flow information in 201 can reduce the training time of the image segmentation model and avoid the repeated extraction of optical flow information from the same set of two training images.

Moreover, it should be noted that in the actual training scene, since there is less open source data, in order to make the training image of the training image segmentation model enough, you can intercept, for example, the portrait matting data set in the live scene and extract the corresponding The mask information of the live screen is used as annotated mask information, thereby increasing the data volume of the training image.

However, it should be noted that the method of intercepting the portrait matting data set still requires manual operation by the user, which will increase the workload of the user in training the image segmentation model.

To this end, on the basis of FIG. 2, please refer to FIG. 7. FIG. 7 shows another schematic flowchart of the image segmentation model training method provided by the present application. Before step 202 is performed, the image segmentation model training method also It can include the following steps:

In step 200, the two pieces of object information obtained are respectively fused with a piece of background information to generate two training images.

In some embodiments, the user can extract the object information in two images that are adjacent in time sequence, and transmit the two object information to the electronic device; then, the electronic device can connect the obtained two object information with the One background information is fused to generate two training images, that is, a set of training images, so as to increase the data volume of the training images.

Of course, it is understandable that the above only uses the fusion of the obtained object information with a piece of background information as an example to illustrate one way of generating two training images; when a large number of training images need to be generated, the electronic device can combine the two training images. Each training image is fused with different background information to generate multiple training image sets, and each training image set includes two training images.

In addition, in combination with the foregoing, in a scene where the foreground information F separated from the matting image and the background information B are merged, the merged image I can be expressed as: I=mF+(1-m)B.

In this expression, m represents the mask information corresponding to the foreground information F. It can be seen that based on this expression, as long as the mask information m corresponding to the foreground information F can be obtained, the foreground information F and any background information B can be fused.

In this way, on the basis of the above-mentioned image segmentation model training method provided in this application, for the image segmentation model trained to the convergence using the image segmentation model training method, the image segmentation model is configured such as the foreground information F and the background in the live scene. The fusion of information B.

Please refer to FIG. 8. FIG. 8 shows a schematic flowchart of the image processing method provided by the present application. In some embodiments, the image processing method may include the following steps:

Step 302: Receive the image to be processed and the background to be fused.

Step 304: Input the image to be processed into the image segmentation model trained to converge using the image segmentation model training method to obtain target mask information corresponding to the image to be processed.

Step 306: Use the target mask information to process the image to be processed and the background to be fused to obtain a fused image.

In some embodiments, for example, in a live broadcast scene, the electronic device may use each frame of the received live video image as a to-be-processed image, and receive a background to be merged. The purpose is: to combine the background information of each frame of the live video image Replace with the background to be blended.

Then, taking one of the live video images as the image to be processed as an example, the electronic device can input the image to be processed into the image segmentation model that is trained to the convergence using the image segmentation model training method provided in this application. The segmentation model outputs the target mask information Mask _m corresponding to the image to be processed.

Finally, the electronic device can use the obtained target mask information Mask _m as the parameter m in the above fusion formula, and substitute the image to be processed and the background to be fused into the fusion formula to obtain the fusion image I. The effect before and after the fusion can be as follows As shown in Figure 9; in this way, after the image segmentation model uses optical flow information to learn the motion information between images, in the process of image fusion, it can combine the motion information between each image and other adjacent images to extract Corresponding to the mask information of the image, which can ensure consistency between consecutive images when performing image fusion.

In addition, based on the same inventive concept as the above-mentioned image segmentation model training method provided by this application, please refer to FIG. 10. FIG. 10 shows a schematic structural block diagram of the image segmentation model training device 400 provided by this application. The training device 400 may include a first processing module 401 and an update module 402. among them:

The first processing module 401 may be configured to obtain a training image set and training annotation information corresponding to the training image set; wherein, the training image set includes two training images that are adjacent in time series, and the difference between the two training images Optical flow information;

The first processing module 401 may also be configured to input two training images into the image segmentation model to obtain two training mask information;

The update module 402 may be configured to update the model parameters of the image segmentation model according to the two training mask information, the training annotation information, and the optical flow information, until the image segmentation model reaches the set convergence condition.

Optionally, as a possible implementation manner, the two training images include a first image that is earlier in time sequence and a second image that is later in time sequence, and the training annotation information is the annotation mask information of the second image;

The two training mask information include first training mask information corresponding to the first image and second training mask information corresponding to the second image;

In the process of updating the model parameters of the image segmentation model according to the two training mask information, the training annotation information, and the optical flow information, the updating module 402 can be configured to:

Obtain the content loss of the image segmentation model according to the labeling mask information, the second training mask information, and the second image;

Obtain the timing loss of the image segmentation model according to the optical flow information, the first image, the second image, the first training mask information, and the second training mask information;

Based on content loss and timing loss, the model parameters of the image segmentation model are updated.

Optionally, as a possible implementation manner, in the process of updating the model parameters of the image segmentation model based on the content loss and the timing loss, the update module 401 can be configured to:

The sum of content loss and timing loss is calculated as the total loss of the image segmentation model to update the model parameters of the image segmentation model with the total loss.

Optionally, as a possible implementation manner, the calculation formula for content loss satisfies the following:

In the formula, Lc represents content loss, mask _gt represents labeled mask information, mask1 _pre represents second training mask information, and I ₁ represents second image;

The calculation formula of the timing loss satisfies the following:

Optionally, as a possible implementation manner, the mask information is marked as the mask information of the captured straight picture.

Optionally, as a possible implementation manner, before obtaining the training image set and the training annotation information corresponding to the training image set, the first processing module 401 may also be configured to:

Extract the inter-frame optical flow between two training images to obtain optical flow information.

Optionally, as a possible implementation manner, in the process of extracting the inter-frame optical flow between two training images to obtain optical flow information, the first processing module 401 may be configured to:

The selflow algorithm is used to extract the inter-frame optical flow between two training images to obtain optical flow information.

The obtained two object information is respectively fused with a piece of background information to generate two training images.

In addition, based on the same inventive concept as the above-mentioned image processing method provided by this application, please refer to FIG. 11. FIG. 11 shows a schematic structural block diagram of the image processing apparatus 500 provided by the present application. The image processing apparatus 500 may include receiving Module 501 and second processing module 502. among them:

The receiving module 501 may be configured to receive the image to be processed and the background to be merged;

The second processing module 502 may be configured to input the image to be processed into an image segmentation model trained to converge using the above-mentioned image segmentation model provided in this application, to obtain target mask information corresponding to the image to be processed;

The second processing module 502 may also be configured to use the target mask information to process the image to be processed and the background to be fused to obtain a fused image.

Optionally, as a possible implementation manner, in the process of receiving the image to be processed, the receiving module 501 may be configured to:

Use each frame of the received live video image as an image to be processed.

In the embodiments provided in this application, it should be understood that the disclosed device and method may also be implemented in other ways. The device embodiments described above are merely illustrative. For example, the flowcharts and block diagrams in the drawings show the possible implementation architecture, functions, and operations of the devices, methods, and computer program products according to some embodiments of the present application. . In this regard, each block in the flowchart or block diagram may represent a module, program segment, or part of the code, and the module, program segment, or part of the code contains one or more configured to implement a prescribed logical function. Executable instructions.

It should also be noted that in some alternative implementations, the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two consecutive blocks can actually be executed substantially in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved.

It should also be noted that each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart, can be implemented by a dedicated hardware-based system that performs the specified functions or actions Or it can be realized by a combination of dedicated hardware and computer instructions.

In addition, the functional modules in some embodiments of the present application may be integrated together to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part.

If the function is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in some embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk and other media that can store program codes.

The foregoing descriptions are only part of the embodiments of the present application, and are not used to limit the present application. For those skilled in the art, the present application may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the protection scope of this application.

For those skilled in the art, it is obvious that the present application is not limited to the details of the foregoing exemplary embodiments, and the present application can be implemented in other specific forms without departing from the spirit or basic characteristics of the present application. Therefore, no matter from which point of view, the embodiments should be regarded as exemplary and non-limiting. The scope of this application is defined by the appended claims rather than the above description, and therefore it is intended to fall into the claims. All changes in the meaning and scope of the equivalent elements of are included in this application. Any reference signs in the claims should not be regarded as limiting the claims involved.

Industrial applicability

Because in the process of training the image segmentation model, two adjacent training images in time sequence and the optical flow information between the two training images are obtained as the training image set, and the corresponding training image set is obtained. Train the annotation information; then input the two training images into the image segmentation model to obtain two training mask information; and then update the model parameters of the image segmentation model according to the two training mask information, training annotation information, and optical flow information. Until the image segmentation model reaches the set convergence condition; in this way, the image segmentation model can use the optical flow information to learn the motion information between images, so that the image segmentation model can combine the motion information between each image and other adjacent images , To extract the mask information of the corresponding image, and then when performing image fusion, it can ensure consistency between consecutive images.

Claims

An image segmentation model training method, characterized in that the method includes:

Obtain a training image set and training annotation information corresponding to the training image set; wherein the training image set includes two training images that are adjacent in time series, and optical flow information between the two training images;

Input the two training images into the image segmentation model to obtain two training mask information;

According to the two training mask information, the training annotation information, and the optical flow information, the model parameters of the image segmentation model are updated until the image segmentation model reaches a set convergence condition.
The method of claim 1, wherein the two training images include a first image that is earlier in time sequence and a second image that is later in time sequence, and the training annotation information is an annotation of the second image Mask information

The two training mask information include first training mask information corresponding to the first image and second training mask information corresponding to the second image;

The step of updating the model parameters of the image segmentation model according to the two training mask information, the training annotation information, and the optical flow information includes:

Obtaining the content loss of the image segmentation model according to the annotation mask information, the second training mask information, and the second image;

Obtaining the timing loss of the image segmentation model according to the optical flow information, the first image, the second image, the first training mask information, and the second training mask information;

Based on the content loss and the timing loss, the model parameters of the image segmentation model are updated.
3. The method of claim 2, wherein the updating the model parameters of the image segmentation model based on the content loss and the timing loss comprises:

The sum of the content loss and the timing loss is calculated as the total loss of the image segmentation model, so as to update the model parameters of the image segmentation model by using the total loss.
The method according to claim 2 or 3, wherein the calculation formula of the content loss satisfies the following:

In the formula, Lc represents the content loss, mask gt represents the labeling mask information, mask1 pre represents the second training mask information, and I 1 represents the second image;

The calculation formula of the timing loss satisfies the following:

In the formula, Lst represents the timing loss, α represents the set parameter,
I 0 represents the first image, warp 01 represents the optical flow information, and mask0 pre represents the first training mask information.
The method according to any one of claims 2-4, wherein the labeling mask information is the mask information of the captured straight picture.
The method according to any one of claims 1 to 5, wherein before obtaining the training image set and the training annotation information corresponding to the training image set, the method further comprises:

The inter-frame optical flow between the two training images is extracted to obtain the optical flow information.
The method according to claim 6, wherein the extracting the inter-frame optical flow between the two training images to obtain the optical flow information comprises:

The selflow algorithm is used to extract the inter-frame optical flow between the two training images to obtain the optical flow information.
The method according to any one of claims 1-7, wherein before obtaining a training image set and training label information corresponding to the training image set, the method further comprises:

The two pieces of object information obtained are respectively fused with a piece of background information to generate the two training images.
The method according to any one of claims 1-8, wherein the network structure of the image segmentation model is Unet network, Deeplabv3 or SEGNET network.
An image processing method, characterized in that the method includes:

Receive the image to be processed and the background to be fused;

Inputting the image to be processed into an image segmentation model trained to converge using the method according to any one of claims 1-9 to obtain target mask information corresponding to the image to be processed;

The target mask information is used to process the to-be-processed image and the to-be-fused background to obtain a fused image.
The method according to claim 10, wherein said receiving the image to be processed comprises:

Use each frame of the received live video image as an image to be processed.
An image segmentation model training device, characterized in that the device includes:

The first processing module is configured to obtain a training image set and training annotation information corresponding to the training image set; wherein, the training image set includes two training images that are adjacent in time series, and one of the two training images Optical flow information between time;

The first processing module is further configured to input the two training images into the image segmentation model to obtain two training mask information;

An update module configured to update the model parameters of the image segmentation model according to the two training mask information, the training annotation information, and the optical flow information, until the image segmentation model reaches a set convergence condition .
An image processing device, characterized in that the device includes:

The receiving module is configured to receive the image to be processed and the background to be fused;

The second processing module is configured to input the image to be processed into an image segmentation model trained to converge using the method according to any one of claims 1-9 to obtain a target mask corresponding to the image to be processed information;

The second processing module is further configured to use the target mask information to process the to-be-processed image and the to-be-fused background to obtain a fused image.
An electronic device, characterized in that it comprises:

The memory is configured to store one or more programs;

processor;

When the one or more programs are executed by the processor, the method according to any one of claims 1-11 is implemented.
A computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the method according to any one of claims 1-11 when the computer program is executed by a processor.