CN114897916A

CN114897916A - Image processing method and device, nonvolatile readable storage medium and electronic equipment

Info

Publication number: CN114897916A
Application number: CN202210493747.XA
Authority: CN
Inventors: 杨勇杰; 林崇仰; 王进
Original assignee: Rainbow Software Co ltd
Current assignee: Rainbow Software Co ltd
Priority date: 2022-05-07
Filing date: 2022-05-07
Publication date: 2022-08-12
Also published as: WO2023217046A1

Abstract

The application discloses an image processing method and device, a nonvolatile readable storage medium and electronic equipment. Wherein, the method comprises the following steps: acquiring first image data of a target object and second image data of a target scene; segmenting the first image data according to the type of the first image data to obtain a first target area corresponding to the target object in the first image data; and fusing the first target area and the second image data to generate a fused image. The method and the device solve the technical problem that the background environment cannot be considered due to the lack of the respective characteristics of the front camera and the rear camera.

Description

Image processing method and device, nonvolatile readable storage medium and electronic equipment

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image processing method and apparatus, a non-volatile readable storage medium, and an electronic device.

Background

The camera is one of the most important sensors in the intelligent terminal, self-shooting is carried out by using the front camera, and the rear camera shoots other people and scenery, and the camera is the most important part in daily life of many people in the present generation no matter storage or sharing. At present, most of cameras and applications in intelligent terminals acquire and process image or video data based on one of a front camera and a rear camera, and the respective characteristics of the front camera and the rear camera are not effectively utilized to make interesting human image or video algorithm special effects. In addition, when a user takes a self-timer, the attention of the user is focused on the user, and the user cannot take account of the background environment.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the application provides an image processing method and device, a nonvolatile readable storage medium and electronic equipment, so as to at least solve the technical problem that the background environment cannot be considered due to the fact that the characteristics of a front camera and a rear camera are not effectively utilized.

According to an aspect of an embodiment of the present application, there is provided an image processing method including: acquiring first image data of a target object and second image data of a target scene; segmenting the first image data according to the type of the first image data to obtain a first target area corresponding to the target object in the first image data; and fusing the first target area and the second image data to generate a fused image.

Optionally, segmenting the first image data according to the type of the first image data to obtain a first target region corresponding to the target object in the first image data, including: when the first image data is a video, segmenting the first image data by using a pre-trained first neural network model to obtain a first target area corresponding to the target object; when the first image data is a picture, the first image data is segmented by using a predetermined second neural network model to obtain a first target area corresponding to the target object.

Optionally, before performing the category determination on the first image data, the method further includes: performing content identification and judgment on the first image data; when the input image data is determined to belong to a scene supported by the segmentation processing, the first image data is subjected to category determination; when the input image data is determined not to belong to a scene supported by the segmentation processing, the first image data ends the processing.

Optionally, the first neural network model is a lightweight model that adopts a cross-layer connection mode of a backbone network and a decoding network as a network structure.

Optionally, the training method of the pre-trained first neural network model includes: obtaining a training data set, wherein the training data set comprises: the image processing method comprises the steps of obtaining first sample image data and a first sample target area, wherein the first sample target area is a target area mask map obtained based on the first sample image data; training a neural network model based on the training data set to generate the first neural network model, wherein in the process of training the first neural network model, consistency constraint is carried out on the first neural network model based on interframe information.

Optionally, after obtaining the first target region corresponding to the target object by using the pre-trained first neural network model, the method further includes: and converting the first target area into a gray mask image, and smoothing the boundary of the gray mask image.

Optionally, after obtaining the first target region corresponding to the target object by using the pre-trained first neural network model, the method further includes: and acquiring a previous frame segmentation result of the first image data, and performing time domain smoothing filtering on the first target region corresponding to the target by using the previous frame segmentation result.

Optionally, the second neural network model is a convolutional neural network model incorporating a hole convolution adoption and attention mechanism.

Optionally, after obtaining the first target region corresponding to the target object by using a pre-trained second neural network model, the method further includes: and combining the first image data and utilizing a pre-trained third neural network to perform optimization processing on the first target area to obtain a processed first target area.

Optionally, the training method of the pre-trained third neural network model includes: acquiring an image of a solid background; performing matting and pre-labeling on the image of the solid background to obtain a label mask image; and training by taking the image of the solid background and the label mask image as sample data to obtain the third neural network model.

Optionally, before the first target region and the second image data are fused and a fused image is generated, the method further includes: and inputting the first target area into a post-processing module for post-processing.

Optionally, performing fusion processing on the first target region and the second image to generate a fused image, including: evaluating environmental information of the second image data, correcting the first target area according to the environmental information, and obtaining a corrected first target area; determining a second target area corresponding to the first target area in the second image data; replacing the second target area with the corrected first target area.

According to another aspect of the embodiments of the present application, there is also provided an image processing apparatus including: the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring first image data of a target object and second image data of a target scene, and an acquisition device for acquiring the first image data and the second image data is positioned on the same terminal equipment; the segmentation module is used for carrying out segmentation processing on the first image data to obtain a first target area corresponding to the target object in the first image data; and the fusion module is used for fusing the first target area and the second image data to generate a fused image.

According to still another aspect of the embodiments of the present application, there is provided a non-volatile storage medium including a stored program, wherein when the program runs, a device in which the non-volatile storage medium is located is controlled to execute the image processing method.

According to yet another aspect of the embodiments of the present application, there is also provided an electronic device, including a memory and a processor; the processor is used for running a program, wherein the program executes the processing method of the image when running.

In the embodiment of the application, acquiring first image data of a target object and second image data of a target scene; segmenting the first image data according to the type of the first image data to obtain a first target area corresponding to the target object in the first image data; the method comprises the steps of fusing the first target area and the second image data to generate a fused image, respectively collecting the first image and the second image through a collecting device on the same terminal device, fusing the first image and the second image, and achieving the purpose of fully utilizing the front camera and the rear camera, so that the technical effect of considering the background environment is achieved, and the technical problem that the background environment cannot be considered due to the fact that the respective characteristics of the front camera and the rear camera are not effectively utilized is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic diagram of an alternative image processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of another alternative image processing method according to an embodiment of the application;

FIG. 3 is a schematic diagram of an alternative image segmentation process flow according to an embodiment of the present application;

fig. 4 is a schematic diagram of an alternative image processing apparatus according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In accordance with an embodiment of the present application, there is provided a method embodiment of processing an image, it should be noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

Fig. 1 is a method for processing an image according to an embodiment of the present application, as shown in fig. 1, the method includes the following steps:

step S102, acquiring first image data of a target object and second image data of a target scene;

step S104, segmenting the first image data according to the type of the first image data to obtain a first target area corresponding to a target object in the first image data;

and step S106, fusing the first target area and the second image data to generate a fused image.

Through the steps, the acquisition device positioned on the same terminal equipment can acquire the first image data and the second image data respectively, and the target object contained in the first image data is fused with the second image data, so that the purpose of fully utilizing the advantages of the front camera and the rear camera is achieved, the technical effect of considering the background environment is achieved, and the technical problem that the background environment cannot be considered due to the lack of the effective utilization of the respective characteristics of the front camera and the rear camera is solved.

It should be noted that the above-mentioned image data acquisition devices are a front camera and a rear camera of the same terminal (for example, a mobile phone and a notebook), and the acquired image data includes the first image data that is the target object, so the first image data acquisition device is not fixed as the front camera but can also be the rear camera, and similarly, the second image data acquisition device is the front camera or the rear camera; the first image data and the second image data may be pictures or videos; the target object may be a portrait or other objects, such as animals and articles; the target scene may be a scene in which the target object is located or any virtual scene.

Specifically, taking the case of using a front camera of a mobile phone to shoot a portrait image and a rear camera to shoot a background video, the image processing steps can be as shown in fig. 2, and the collected portrait image and background video are respectively fused after different processing flows to obtain a fused video, so that on one hand, foreground and background information are collected at the same time, and more scene information of a shooting place can be retained and restored; on the other hand, the advantages of the front camera and the rear camera of the mobile phone are fully utilized, the imaging condition of the self-shot portrait can be conveniently observed, and the information of the whole background scene can be conveniently observed so as to obtain better framing.

In some embodiments of the present application, segmenting the first image data according to the type of the first image data to obtain a first target region corresponding to the target object in the first image data includes: when the first image data is a video, segmenting the first image data by using a pre-trained first neural network model to obtain a first target area corresponding to the target object; when the first image data is a picture, the first image data is segmented by using a predetermined second neural network model to obtain a first target area corresponding to the target object.

It should be noted that, since the present application does not limit the type of data collected by the terminal, the collected first image data and second image data may be pictures or videos, and different processing targets and methods are different for different types of data in actual segmentation, so that the type of the first image data is determined before the image data is segmented. Further, for the segmentation processing of the video data, not only the precision is ensured, but also the real-time processing is maintained, illustratively, the first neural network model may be a lightweight neural network model for performing the segmentation processing on the video image; for the segmentation process of the picture data, the second neural network model may be, for example, a convolutional neural network model, since the image puts higher requirements on details.

In some embodiments of the present application, before the determining the category of the first image data, the method further includes: performing content identification and judgment on the first image data; when the input image data is determined to belong to a scene supported by the segmentation processing, the first image data is subjected to category determination; when the input image data is determined not to belong to a scene supported by the segmentation processing, the first image data ends the processing.

Specifically, taking image data of a mobile device as an example, extracting a target region in a picture is realized by a computer vision method, and a processing flow is as shown in fig. 3: inputting image data, preprocessing the image, distinguishing whether the preprocessed image belongs to a scene supported by a segmentation engine, if so, processing, and if not, processing. Firstly, a scene supported by a segmentation engine requires an image to contain a target object, wherein the type of the target object is preset by a user, and the type of the target object contained in the image is obtained through a detection and recognition algorithm; secondly, the distance between the target object and the acquisition device needs to meet a preset condition, if the distance between the target object and the acquisition device exceeds a preset range, the target object acquired by the target object is incomplete due to too close distance, for example, only part of five sense organs cannot form a group photograph, the detail loss of the acquired target object is serious due to too far distance, the subsequent fusion effect is seriously influenced, and therefore the acquisition device cannot be started.

It should be noted that the image data is preprocessed by taking the image data of the mobile device as input, converting the image data into a corresponding data format according to the requirements of a subsequent segmentation network, and adjusting the size, color, angle and the like of the image to obtain usable image data; the method is characterized in that the pre-processed image is distinguished whether the image belongs to a scene supported by a segmentation engine or not, the scene is distinguished and formed by a scene distinguishing convolutional neural network trained in advance based on massive tagged surveillance data, the content of an input image can be quickly and accurately identified, and a subsequent processing flow is started only when the input image is judged to belong to the scene supported by the segmentation engine. The identification network is a typical classification network consisting of a convolutional layer, an active layer and a pooling layer, and the number and the width of the network layers are strictly limited and optimized in consideration of the performance requirement of the mobile equipment end, so that the millisecond-level operation of the equipment end is ensured.

In some embodiments of the present application, the first neural network model is a lightweight model that adopts a cross-layer connection manner of a backbone network and a decoding network as a network structure. The training method of the pre-trained first neural network model comprises the following steps: obtaining a training data set, wherein the training data set comprises: the image processing method comprises the steps of obtaining first sample image data and a first sample target area, wherein the first sample target area is a target area mask map obtained based on the first sample image data; training a neural network model based on the training data set to generate the first neural network model, wherein in the process of training the first neural network model, consistency constraint is carried out on the first neural network model based on interframe information.

Specifically, for the first neural network model, the training data set is first created by collecting a large amount of first sample image data, where the first sample image data needs to contain the group photo of the object and the background at the same time, and since the training network and the prediction network are independent from each other, the sample image data does not limit the object and the scene to be the target object and the target background during recognition, but the sample image data needs to cover the actual use scene category. In addition, the generation mode of the first sample image data is not limited in the application, and the actual scene acquisition can be realized or the later synthesis can be realized. Further, after the first sample image data is obtained, the first sample target area is determined through manual or automatic identification, that is, a target area mask image is obtained by labeling the first sample image data and is used as supervision information. In consideration of the real-time requirement of video application, the network part adopts a lightweight model and consists of a convolutional layer, an active layer, a pooling layer and a deconvolution layer. And the network layer number, convolution type, down-sampling position and the like are optimized in a targeted manner. The network structure adopts a mode of cross-layer connection of a backbone network and a decoding network, and the segmentation precision is improved. Meanwhile, the calculation performance of real-time segmentation is achieved by means of instruction optimization and special equipment optimization of a running hardware platform in network deployment. In the training process, in order to improve the stability of the result on the video time sequence, interframe information consistency constraint is added in the training process.

In some embodiments of the present application, after obtaining the first target region corresponding to the target object by using the pre-trained first neural network model, the method further includes: converting the first target area into a gray mask image; and smoothing the boundary of the gray mask image.

Specifically, after the video obtains a corresponding first target region through a first neural network model trained in advance, the first target region is converted into a gray mask image with a background of 0. And secondly, small isolated regions are removed through an image processing algorithm, the boundary of the mask is smoothed, the introduction of the mask is beneficial to shielding noise outside the target region, and interested regions can be fully extracted and utilized, so that the segmentation result is optimized.

In some embodiments of the present application, after obtaining the first target region corresponding to the target object by using the pre-trained first neural network model, the method further includes: and acquiring a previous frame segmentation result of the first image data, and performing time domain smoothing filtering on a first target region corresponding to the target by using the previous frame segmentation result. Specifically, the segmentation result of the previous frame is utilized to perform time domain smoothing filtering, so that the inter-frame stability is increased, and the continuity of the output video result is ensured.

In some embodiments of the present application, the second neural network model is a convolutional neural network model incorporating a hole convolution and attention mechanism.

Specifically, for the second neural network model, the network model is composed of a deep convolution network with more parameter quantities and more complex structure. In consideration of the fact that higher requirements are put forward on details in the photographing mode, more precise standards are adopted when training data are manually marked, and the quality of supervision data is improved. Meanwhile, the requirement on the computational power performance in the photographing mode is reduced, so that structures such as hole convolution, attention mechanism and the like are properly added to the network part to improve the network analysis capability. The depth, width and size of the characteristic graph of the network are properly relaxed, so that the requirement of more accurate segmentation is met. In the deployment stage, the operation speed of the model is improved by means of instruction optimization and special equipment optimization on the operation platform.

It should be noted that the first target area may be an area where the target object is located in the first image data.

In some embodiments of the present application, after obtaining the first target region corresponding to the target object by using a pre-trained second neural network model, the method further includes: and combining the first image data and utilizing a pre-trained third neural network to perform optimization processing on the first target area to obtain a processed first target area.

The training method of the pre-trained third neural network model comprises the following steps: acquiring an image of a solid background; performing matting and pre-labeling on the image of the solid background to obtain a label mask image; and training by taking the image of the solid background and the label mask image as sample data to obtain the third neural network model.

It should be noted that, as shown in fig. 3, the third neural network may be a Matting network (a neural network used for image refinement and segmentation).

Specifically, the input in the third neural network is the output of the second neural network model, and the original first image data. Due to the limitations of resolution, down-sampling and the like of the segmentation network, the region map obtained by the second neural network cannot obtain fine segmentation results in the regions of object edges, hairs and the like. The Matting network is taken as an example to explain the steps, on the basis of the first target area output by the second neural network, the Matting network acquires a trimap image (a three-part image) according to the confidence coefficient, and meanwhile, an attention mechanism is added into the network, so that the network can be more focused on the edge, and the edge precision is improved. The output of the network is a montage map, which is the first target region map after processing, and is a regression of the opacity of each pixel location.

Because the third neural network training needs more fine transparency as supervision data, the training data is acquired by adopting three steps of fixed-point acquisition, automatic pre-labeling and manual correction. Specifically, fixed-point acquisition refers to establishing a pure-color background acquisition environment, and the environment light is adjustable to acquire relatively natural pure-color background data. And automatically pre-labeling the pure-color background data by adopting an image matting algorithm and a trained third neural network to obtain an initial montage picture. And finally, fine-tuning the error area in the initial mask diagram result in a manual correction mode to obtain a final labeling result. In the implementation process, the third neural network of the automatic pre-labeling part can iterate along with data updating, so that the pre-labeling effect is continuously improved, and the manual labeling cost is further gradually reduced.

In some embodiments of the present application, before performing fusion processing on the first target region and the second image data to generate a fused image, the method further includes: and inputting the first target area into a post-processing module for post-processing.

It should be noted that, in a case that the terminal device is movable, for example, a mobile phone is often shot by a handheld device, a shot video often includes a certain jitter, and the jitter in the video can be reduced or eliminated by using a video anti-jitter module through hardware and software technologies, so that the shot video is more stable. The application contains video to the target image kind, all can add the anti-shake module and carry out anti-shake processing.

In some alternative ways, due to the limitations of mobile computing power and memory, the semantic segmentation algorithm based on the neural network can only process images with smaller resolution, generally no more than 512 × 512 pixels, which is often much smaller than the original images with high resolution (720P), ultra-definition (1080P), even 4K, and the like. In a general processing mode, an image is firstly down-sampled to a neural network resolution, and then up-sampled to an original resolution after a result is obtained. The result after direct upsampling and the original resolution are often misaligned on the boundary, and due to the scaling relationship, the details of the small graph are lost, and after the upsampling, the segmentation result obtained for the details which cannot be shown on the small graph is also inaccurate. Based on the human image foreground region, the human image foreground region which is more accurate in alignment with the boundary of the original resolution image and richer in details is calculated by the edge filter algorithm based on the segmentation result of the first neural network, the second neural network or the third neural network by adopting the post-processing module.

In some embodiments of the present application, performing fusion processing on the first target region and the second image data to generate a fused image includes: estimating environmental information of the second image data, correcting the first target area according to the environmental information, and obtaining a corrected first target area; determining a second target area corresponding to the first target area in the second image data; replacing the second target area with the corrected first target area.

The fusion is realized according to the respective characteristics of the foreground and the background to ensure that the result is as natural as possible, the naturalness of the color of the foreground is reserved, and the color tone of the foreground can be adjusted to be more naturally adapted to the color tone of the background, so that the fused image forms a complete whole visually, and the matting marks are reduced. The method and the device do not limit the types of the fused objects, and can realize fusion of image foreground and image background, image foreground and video background, video foreground and video background and video foreground and image background. Illustratively, the result video obtained by fusing the output portrait with the background video is obtained by taking the portrait foreground area as the first target area and the background video result as the second image.

Specifically, because the foreground and background target object positions and image quality are different, direct replacement in the fusion module will result in inconsistent final results. Therefore, the method and the device can firstly evaluate the environmental information of the background, including information such as the direction, the intensity and the color of illumination, and then correspondingly correct the foreground, so as to ensure the uniform quality of the image of the front background. Secondly, regarding the placement position, the position of the foreground region placed in the background region is generally placed in the center, and in addition, any one or more boundaries of the upper, lower, left and right sides of the foreground image contact the image boundary, and the fusion result can also keep the characteristic. In practical application, the acquired background image to-be-fused area is not a pure background, and a shielding condition exists, for example, people and objects of non-target objects appear in the area.

In addition, if the object to be fused includes a video, the computation amount involved in the fusion computation is huge, and in order to not only ensure the accuracy but also maintain the real-time processing, the corresponding object is simplified, including optimizing the overall process, only performing lightweight network segmentation without using a fine-level network, and correspondingly simplifying each module, for example, reducing or even completely removing post-processing, and further compressing the model for the used lightweight network.

An embodiment of the present application further provides an image processing apparatus, as shown in fig. 4, including: an obtaining module 40, configured to obtain first image data of a target object and second image data of a target scene; a segmentation module 42, configured to perform segmentation processing on the first image data according to a type of the first image data to obtain a first target region corresponding to the target object in the first image data; and a fusion module 44 for performing fusion processing on the first target region and the second image data to generate a fused image.

The segmentation module 42 includes: a first processing sub-module; the judgment submodule is used for judging the category of the first image data; when the first image data is a video, segmenting the first image data by using a pre-trained first neural network model to obtain a first target area corresponding to the target object; when the first image data is a picture, the first image data is segmented by using a predetermined second neural network model to obtain a first target area corresponding to the target object.

The judgment submodule comprises: the device comprises a judging unit, a first training unit and a converting unit;

a determination unit configured to perform content recognition and determination on the first image data, the first image data performing category determination when the input image data is determined to belong to a scene supported by segmentation processing; when the input image data is determined not to belong to a scene supported by the segmentation processing, the first image data ends the processing.

The first training unit is configured to obtain a training data set, wherein the training data set includes: the image processing method comprises the steps of obtaining first sample image data and a first sample target area, wherein the first sample target area is a target area mask map obtained based on the first sample image data; training a neural network model based on the training data set to generate the first neural network model, wherein in the process of training the first neural network model, consistency constraint is carried out on the first neural network model based on interframe information.

The conversion unit is used for converting the first target area into a gray mask image; and smoothing the boundary of the gray mask image.

And the first processing submodule is used for combining the first image data and optimizing the first target area by using a pre-trained third neural network to obtain a processed first target area.

The first processing submodule includes: a second training unit;

the second training unit is used for acquiring an image of a solid background; performing matting and pre-labeling on the image of the solid background to obtain a label mask image; and taking the image of the solid background as sample data, and taking the label mask image as the supervision data to train to obtain the third neural network model.

The fusion module 44 includes: a second processing submodule and a generating submodule; and the second processing submodule is used for inputting the first target area into the post-processing module for post-processing.

The generation submodule is used for evaluating the environment information of the second image data, correcting the first target area according to the environment information and obtaining a corrected first target area;

determining a second target area corresponding to the first target area in the second image data;

replacing the second target area with the corrected first target area.

The nonvolatile storage medium stores a program for executing the following functions: acquiring a first image of a target object and a second image of a target scene; segmenting the first image data according to the type of the first image data to obtain a first target area corresponding to a target object in the first image; and fusing the first target area and the second image to generate a fused image.

The processor is used for running a program for executing the following functions: acquiring a first image of a target object and a second image of a target scene; segmenting the first image data according to the type of the first image data to obtain a first target area corresponding to a target object in the first image; and fusing the first target area and the second image to generate a fused image.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit may be a division of a logic function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or may not be executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A method of processing an image, comprising:

acquiring first image data of a target object and second image data of a target scene;

segmenting the first image data according to the type of the first image data to obtain a first target area corresponding to the target object in the first image data;

and fusing the first target area and the second image data to generate a fused image.

2. The method according to claim 1, wherein performing segmentation processing on the first image data according to the type of the first image data to obtain a first target region corresponding to the target object in the first image data comprises:

when the first image data is a video, segmenting the first image data by using a pre-trained first neural network model to obtain a first target area corresponding to the target object;

when the first image data is a picture, the first image data is segmented by using a predetermined second neural network model to obtain a first target area corresponding to the target object.

3. The method of claim 1, wherein prior to performing the classification determination on the first image data, the method further comprises:

performing content identification and judgment on the first image data;

performing a category determination on the first image data when the input first image data is determined to belong to a scene supported by a segmentation process;

when the input first image data is determined not to belong to a scene supported by segmentation processing, the first image data ends processing.

4. The method of claim 2, wherein the first neural network model is a lightweight model that employs a cross-layer connection of a backbone network and a decoding network as a network structure.

5. The method of claim 2, wherein the method of training the pre-trained first neural network model comprises:

obtaining a training data set, wherein the training data set comprises: the image processing method comprises the steps of obtaining first sample image data and a first sample target area, wherein the first sample target area is a target area mask map obtained based on the first sample image data;

training a neural network model based on the training data set to generate the first neural network model, wherein in the process of training the first neural network model, consistency constraint is carried out on the first neural network model based on interframe information.

6. The method of claim 2, wherein after obtaining the first target region corresponding to the target object by using the pre-trained first neural network model, the method further comprises:

and converting the first target area into a gray mask image, and smoothing the boundary of the gray mask image.

7. The method of claim 2, wherein after obtaining the first target region corresponding to the target object by using the pre-trained first neural network model, the method further comprises:

and acquiring a previous frame segmentation result of the first image data, and performing time domain smoothing filtering on the first target region corresponding to the target by using the previous frame segmentation result.

8. The method of claim 2, wherein the second neural network model is a convolutional neural network model that incorporates a hole convolution and attention mechanism.

9. The method of claim 2, wherein after obtaining the first target region corresponding to the target object using a pre-trained second neural network model, the method further comprises:

and combining the first image data and utilizing a pre-trained third neural network to perform optimization processing on the first target area to obtain a processed first target area.

10. The method of claim 9, wherein the method of training the pre-trained third neural network model comprises:

acquiring an image of a solid background;

performing matting and pre-labeling on the image of the solid background to obtain a label mask image;

and training by taking the image of the solid background and the label mask image as sample data to obtain the third neural network model.

11. The method according to claim 1 or 9, wherein before performing the fusion process on the first target region and the second image data to generate a fused image, the method further comprises:

and inputting the first target area into a post-processing module for post-processing.

12. The method of claim 1, wherein fusing the first target region with the second image data to generate a fused image comprises:

evaluating environmental information of the second image data, correcting the first target area according to the environmental information, and obtaining a corrected first target area;

replacing the second target area with the corrected first target area.

13. An image processing apparatus characterized by comprising:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring first image data of a target object and second image data of a target scene;

the segmentation module is used for segmenting the first image data according to the type of the first image data to obtain a first target area corresponding to the target object in the first image data;

and the fusion module is used for fusing the first target area and the second image data to generate a fused image.

14. A non-volatile storage medium, comprising a stored program, wherein when the program runs, a device in which the non-volatile storage medium is located is controlled to execute the image processing method according to any one of claims 1 to 12.

15. An electronic device comprising a memory and a processor; the processor is configured to execute a program, wherein the program executes a method for processing an image according to any one of claims 1 to 12.