CN117541613A

CN117541613A - Picture processing method, device, equipment and storage medium

Info

Publication number: CN117541613A
Application number: CN202311283862.5A
Authority: CN
Inventors: 吴益欢
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-02-09

Abstract

The application discloses a picture processing method, a device, equipment and a storage medium, and relates to the technical field of image processing. The method comprises the following steps: performing region segmentation on the original picture to obtain a segmentation map corresponding to the original picture; generating a mask corresponding to the original picture based on at least one main body area determined from the plurality of areas; according to the original picture and the mask, extracting a main body region in the original picture to obtain a main body picture corresponding to the original picture; and filling the main body area in the original picture according to the original picture, the mask and the content description text corresponding to the original picture, and generating a background picture corresponding to the original picture. The method can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like. The method realizes the automation of separating the main body picture from the background picture, reduces the steps and the difficulty of user operation, and remarkably improves the efficiency of separating the main body picture from the background picture.

Description

Picture processing method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a method, a device, equipment and a storage medium for processing pictures.

Background

At present, there is a need for separating a main region from a background of an original picture, i.e., dividing the original picture into a main picture and a background picture.

In the related art, a user is required to manually perform an operation, and a main body region and a background in an original picture are segmented by using a matting tool of image processing software, so as to obtain a main body picture and a background picture with the main body region scratched. The main body picture comprises a main body area of the original picture. And then, filling the content of the position of the main body region in the background picture with the main body region removed by using a repairing tool to obtain the background picture.

However, the main body picture and the background picture corresponding to the original picture are obtained through the segmentation in the mode, and the problems of high operation difficulty and low efficiency exist.

Disclosure of Invention

The embodiment of the application provides a picture processing method, device, equipment and storage medium. The technical scheme provided by the embodiment of the application is as follows:

according to an aspect of an embodiment of the present application, there is provided a picture processing method, including:

performing region segmentation on an original picture to obtain a segmentation map corresponding to the original picture, wherein the segmentation map comprises a plurality of regions;

Generating a mask corresponding to the original picture based on at least one main body region determined from the plurality of regions, wherein the mask is used for distinguishing the main body region from other regions except the main body region in the original picture;

extracting the main body region in the original picture according to the original picture and the mask to obtain a main body picture corresponding to the original picture;

and filling the main body area in the original picture according to the content description text corresponding to the original picture, and generating a background picture corresponding to the original picture.

According to an aspect of an embodiment of the present application, there is provided a picture processing apparatus, including:

the image processing device comprises a region segmentation module, a region segmentation module and a display module, wherein the region segmentation module is used for carrying out region segmentation on an original image to obtain a segmentation image corresponding to the original image, and the segmentation image comprises a plurality of regions;

a mask generation module, configured to generate a mask corresponding to the original picture based on at least one main area determined from the plurality of areas, where the mask is used to distinguish the main area from other areas except the main area in the original picture;

The main body extraction module is used for extracting the main body area in the original picture according to the original picture and the mask to obtain a main body picture corresponding to the original picture;

and the content filling module is used for filling the main body area in the original picture according to the content description text corresponding to the original picture to generate a background picture corresponding to the original picture.

According to an aspect of the embodiments of the present application, there is provided a computer device including a processor and a memory, the memory having stored therein a computer program that is loaded and executed by the processor to implement the above-described method.

According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium having stored therein a computer program that is loaded and executed by a processor to implement the above-described method.

According to one aspect of embodiments of the present application, there is provided a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device performs the above-described method.

The technical scheme provided by the embodiment of the application at least comprises the following beneficial effects:

the original picture is subjected to region segmentation to obtain a segmentation map comprising a plurality of regions, main body regions are confirmed from the segmentation map, a corresponding mask is generated, the main body picture is extracted from the original picture according to the mask, then the main body regions in the original picture are filled according to the content description text corresponding to the original picture, the background picture corresponding to the original picture can be obtained, automation of separation of the main body picture and the background picture is realized, steps and difficulty of user operation are reduced, and the separation efficiency of the main body picture and the background picture is remarkably improved. And the main body area in the original picture is filled based on the content description text corresponding to the original picture, so that the generated filling content is more accurate, and the quality of the background picture is improved.

Drawings

FIG. 1 is a schematic diagram of an implementation environment for an embodiment provided herein;

FIG. 2 is a flow chart of a picture processing method provided in one embodiment of the present application;

FIG. 3 is a schematic diagram of an original picture provided in one embodiment of the present application;

fig. 4 is a schematic diagram of a segmentation map corresponding to an original picture according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a mask corresponding to an original picture according to one embodiment of the present application;

fig. 6 is a schematic diagram of a subject picture corresponding to an original picture according to an embodiment of the present application;

fig. 7 is a schematic diagram of a background picture corresponding to an original picture according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a picture layering interface of a picture processing tool provided in one embodiment of the present application;

FIG. 9 is a program flow diagram of a picture processing method provided in one embodiment of the present application;

FIG. 10 is a block diagram of a picture processing apparatus provided in one embodiment of the present application;

fig. 11 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Artificial intelligence (Artificial Intelligence, AI for short) is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of researching how to make a machine "look at", and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as identifying and measuring objects, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important innovation for the development of computer vision technology, and a pre-trained model in the vision fields of swin-transformer, viT, V-MOE, MAE and the like can be rapidly and widely applied to downstream specific tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (Optical Character Recognition, abbreviated OCR), video processing, video semantic understanding, video content/behavior recognition, three-Dimensional object reconstruction, three-Dimensional (3D) techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others.

Machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, digital twin, virtual man, robot, artificial intelligence generation content (Artificial Intelligence Generated Content, abbreviated as AIGC), conversational interactions, smart medical services, smart customer service, game AI, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

The scheme provided by the embodiment of the application relates to the technologies of computer vision, machine learning and the like of artificial intelligence, and is specifically described through the following embodiments.

Referring to fig. 1, a schematic diagram of an implementation environment of an embodiment of the present application is shown. The implementation environment of the scheme can comprise: computer device 10.

In the method provided by the embodiment of the present application, the execution subject of each step may be a computer device 10, where the computer device 10 is an electronic device with data computing, processing and storage functions. The computer device 10 may be a terminal device or a server.

Exemplary terminal devices include, but are not limited to, mobile phones, computers, intelligent voice interaction devices, intelligent appliances, vehicle terminals, aircraft, game hosts, wearable devices, multimedia playback devices, augmented Reality (Augmented Reality, AR) devices, virtual Reality (VR) devices, and like electronic devices.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, abbreviated as CDN), basic cloud computing services such as big data and an artificial intelligence platform, but is not limited thereto.

As shown in fig. 1, by executing the method provided in the embodiment of the present application, the computer device 10 can automatically segment the main area and the background of the original picture 11, so as to obtain a main image 12 and a background image 13 corresponding to the original picture 11. The subject picture 12 includes a subject region of the original picture 11, which may be automatically identified by the computer device 10 or may be a region manually specified by a user, which is not limited in this application. The background picture 13 includes a portion of the original picture 11 from which the main area is removed, and a picture content obtained by content-filling at the position of the main area.

The embodiments of the present application may be applied to various scenarios including, but not limited to, cloud technology, artificial intelligence, intelligent transportation, assisted driving, and the like.

In the cloud technology application scenario, for example, the segmentation of the main body region and the background may be performed on an original picture taken or generated by the user in the game application, the social application or any other application program, so as to obtain a main body picture and a background picture corresponding to the original picture. Taking a game application scene as an example, taking a picture of a current scene picture as an original picture in a game, and separating a game role in the original picture from a background to obtain a main body picture comprising the game role and a background picture without the game role.

In an artificial intelligence application scene, for example, the segmentation of a main body region and a background can be performed on original pictures in different application scenes such as a character picture, an environment picture, a medical picture and the like, so as to obtain a main body picture and a background picture corresponding to the original pictures. Taking the figure picture as an example, separating the figure in the figure picture from the background to obtain a main body picture comprising the figure and a background picture without the figure.

In an application scenario of intelligent traffic or driving assistance, an environment picture acquired by a camera installed on a vehicle or a road side may be taken as an original picture, and a main body region and a background are segmented to obtain a main body picture and a background picture corresponding to the original picture. For example, the vehicle in the photographed environmental picture is separated from the background, so that a main body picture containing the vehicle and a background picture without the vehicle are obtained, and at this time, a driver can clearly see a complete road without the vehicle shielding from the background picture.

Of course, the above application scenario and the examples of separating the main body from the background are merely exemplary and explanatory, and the technical solution of the present application may be applied to any application scenario with the main body separated from the background, which is not limited in this application.

Referring to fig. 2, a flowchart of a picture processing method according to an embodiment of the present application is shown. The method may comprise at least one of the following steps (210-240):

step 210, performing region segmentation on the original picture to obtain a segmentation map corresponding to the original picture, where the segmentation map includes a plurality of regions.

An original picture is a picture that is made up of pixels, each of which contains a portion of the information of the picture. The original picture may take a variety of formats and representations, including but not limited to the following formats or representations: joint photographic experts group (Joint Photographic Experts Group, abbreviated as JPEG), portable network graphics (Portable Network Graphics, abbreviated as PNG), graphics (Bitmap, abbreviated as BMP), label image file format (Tagged Image File Format), RAW, scalable vector graphics (Scalable Vector Graphics, abbreviated as SVG), and the like. Each pixel uses numerical values to represent the brightness, color, position and other attributes on the picture, for example, in a color picture, three channels of Red Green Blue (RGB) are used to represent the color, and the color in the picture can be accurately specified and adjusted by controlling the intensity values of the three channels; in black and white or gray scale pictures, a single channel is used to represent the brightness or gray scale level. The resolution of a picture refers to the number of pixels of the picture in each direction, the higher the resolution, the sharper the picture.

In some embodiments, the original picture may be uploaded by the user, or may be a picture in a picture library. The picture library is used for storing material pictures, and can be uploaded by related technicians or pictures uploaded by users in history.

Region division refers to dividing a picture into a plurality of regions. In the divided image corresponding to the original image, a plurality of areas included in the original image are recorded. The region is composed of pixels belonging to the same class. In some embodiments, the original picture is segmented based on its content, and each region corresponds to one content and different regions may correspond to different contents in the resulting segmented picture.

In some embodiments, the size of the segmentation map corresponding to the original picture is the same as that of the original picture, and in the segmentation map, each pixel belonging to the same region has the same display pattern (e.g., pixel value), and pixels belonging to different regions have different display patterns (e.g., pixel values).

In some embodiments, the regions in the segmentation map are identified using a display style; the display style is used for distinguishing different areas in the segmentation map. The display patterns may be color, number, or other information capable of distinguishing different areas, and are not limited in this application.

In some embodiments, the following manner is adopted to obtain a segmentation map corresponding to the original picture: extracting picture characteristics of an original picture; determining a category corresponding to each pixel in the original picture according to the picture characteristics of the original picture; obtaining a segmentation map corresponding to the original picture according to the category corresponding to each pixel in the original picture; wherein each pixel belonging to the same class corresponds to one region in the segmentation map, and pixels in different regions correspond to different classes.

In some embodiments, the above-mentioned process of generating the segmentation map corresponding to the original picture is implemented by an image segmentation model. In some embodiments, the image segmentation model includes a convolutional neural network (Convolutional Neural Network, CNN for short) and a full convolutional neural network (Fully Convolutional Neural Network, FCN for short). CNN is a class of deep learning models mainly used for image recognition and image processing. FCN is a special type of CNN, mainly used for image segmentation tasks.

In some embodiments, the picture features of the original picture are extracted by a convolutional neural network. The method includes the steps of inputting an original picture into a CNN to obtain a low-resolution feature map corresponding to the original picture, wherein the feature map contains picture features extracted from the original picture. The feature map is a multi-dimensional array for storing features of pictures, including advanced feature representations extracted from the original pictures. One pixel in the feature map corresponds to one local area of the original picture, and the spatial structure of the original picture is reserved.

The picture features are information extracted from the original picture through the CNN, and can be stored in a feature map, and the picture features can be of different levels and abstraction levels, including but not limited to the following features: edge features, texture features, color features, shape features, semantic features, local features, global features, depth features, contextual features, multi-scale features, and the like.

In some embodiments, a class is allocated to each pixel of the feature map through FCN, and then the resolution of the feature map is increased by using upsampling operation, so as to obtain a segmentation map corresponding to the original picture. The FCN upsampling operation is a process of adding a low resolution feature map to a higher resolution feature map, restoring the segmentation map to the size of the original picture.

In some embodiments, the categories for classification are different for different application scenarios.

Illustratively, in a game application scenario, the categories for categorizing include: characters, buildings, mountains, etc.; in animal pictures, categories for classification include: rabbits, lions, tigers, penguins, etc.

Illustratively, the feature map output by the CNN is input into the FCN, the FCN performs an up-sampling operation on the feature map, and assigns a class to each pixel of the feature map, and outputs a segmentation map, where each pixel value in the segmentation map represents its class.

As shown in fig. 3, a schematic diagram of an original picture 30 provided in an embodiment of the present application is illustrated, the original picture 30 is input into a CNN to obtain a feature map, the feature map is input into the FCN, and different areas in the segmentation map are identified by using different colors to obtain the segmentation map. As shown in fig. 4, which is a schematic diagram of a segmentation map 40 corresponding to an original picture provided in an embodiment of the present application, five regions, that is, a region 41, a region 42, a region 43, a region 44, and a region 45, respectively, represent five categories, which can be clearly seen from the figure.

By the method, the categories of the objects in the original picture are identified and distinguished in a certain mode, and preparation is made for determining the main body area and generating the mask.

And 220, generating a mask corresponding to the original picture based on at least one main body area determined from the plurality of areas, wherein the mask is used for distinguishing the main body area from other areas except the main body area in the original picture.

The main area refers to an area that needs to be separated from the background in the original picture, and includes the main content in the original picture. In different application scenarios, the content included in the body region is different. For example, an area where a person in the person picture is located may be taken as a main area, an area where a vehicle in the environment picture is located may be taken as a main area, and the like. The number of the main body regions may be one or plural (i.e., at least two), and is not limited in this application.

In some embodiments, the main body region may be manually specified, or may be a default category region. The default category is the category that needs to be automatically identified by the relevant technician. The default class may be one or more of the classes corresponding to the pixels in the above-described segmentation map.

Optionally, when the main area is not identified, the user is requested to reselect the original picture, or the original picture is automatically replaced, or the process is stopped, which is not limited in this application.

Illustratively, assuming that the subject area is manually specified by the user, the user specifies one or more areas from among the plurality of areas of the above-described division map as the subject area. As shown in fig. 4, the user designates the region 42 as a main body region from the five regions of the above-described divided map, and the main body region is the region 42 at this time. Alternatively, the user designates the region 41 and the region 42 as the main body region from the five regions of the above-described divided map, and the main body region includes the region 41 and the region 42 at this time.

For example, if the subject region is a region of a default type, the region corresponding to the default type in the segmentation map is determined as the subject region after the segmentation map is obtained. The default category may be one category or a plurality of categories. For example, if the default category is set as a person, the main body area is an area corresponding to the person, and as shown in fig. 4, the main body area is an area 41.

In some embodiments, the number of body regions may or may not be fixed. When the number of the main body regions is fixed, the number of the regions may be determined as the main body regions by a certain rule from the regions conforming to the category requirements of the main body regions, for example, selected in accordance with the area size of the regions. For example, assuming that the main body area is a person area, the number of the main body areas is a fixed value of 1, when there is only one person area in the original picture, the person area is taken as the main body area; when a plurality of character areas are included in the original picture, the character area having the largest area may be selected as the main body area. When the number of body regions is not fixed, the number of body regions is determined by the content of the original picture. For example, assuming that the main body area is a person area, the number of the main body areas is determined by the content of the original picture, when there is only one person area in the original picture, the person area is taken as the main body area, and at this time, the number of the main body areas is only 1; when the original picture includes a plurality of character areas, the plurality of character areas are respectively used as main body areas, and a plurality of main body areas exist at the moment.

In some embodiments, when the number of the main body areas is greater than or equal to 2, the plurality of main body areas may be areas of the same category, or may be areas of different categories, which is not limited in this application. For example, a plurality of character areas in the original picture may be used as the main body area, or a character area and a carrier area in the original picture may be used as the main body area.

In some embodiments, setting a pixel value of a main body area to a first value, setting a pixel value of other areas except the main body area to a second value, and generating a mask corresponding to an original picture; wherein the first value and the second value are different.

A mask is a picture or matrix having the same size as the original picture, and each pixel or element is used to mark a specific area in the original picture, i.e., the main area and other areas except the main area in the original picture. There are a variety of ways in which pixel values in a mask may be represented, including but not limited to the following: binary, multi-class, probabilistic, floating point, etc. Illustratively, the pixel values in the mask are set in a binary manner, with 1 indicating that the mask belongs to a specified region, i.e., the subject region, and 0 indicating that the mask does not belong to the specified region. As shown in fig. 5, a schematic diagram of a mask 50 corresponding to an original picture provided in an embodiment of the present application is shown, where a region 51 is a main region, a pixel value thereof is 1, a region 52 is another region other than the main region, and a pixel value thereof is 0.

In this way, the main region is distinguished from other regions, so that only two regions remain in the picture, and preparation is made for extracting the main region next.

Step 230, extracting the main body region in the original picture according to the original picture and the mask, and obtaining the main body picture corresponding to the original picture.

A subject picture is a picture that retains only the subject region in the original picture.

In some embodiments, each pixel outside the main body region in the corresponding mask in the original picture is set to be transparent, so as to obtain the main body picture corresponding to the original picture.

Illustratively, each pixel in the region 52 (e.g., fig. 5) corresponding to the mask 50 (e.g., fig. 5) in the original picture 30 (e.g., fig. 3) is set to be transparent, so as to obtain a main body picture 60 corresponding to the original picture 30, as shown in fig. 6, which illustrates a schematic diagram of the main body picture 60 corresponding to the original picture 30 provided in an embodiment of the present application, and the element 61 is a main body region.

By the method, the main body picture is extracted from the original picture, and main body content in the original picture can be rapidly acquired for a user.

And 240, filling the main body area in the original picture according to the content description text corresponding to the original picture, and generating a background picture corresponding to the original picture.

The content description text is descriptive text about the content of the original picture.

In some embodiments, a picture content recognition model is adopted to recognize the content of the original picture, so as to obtain a content description text corresponding to the original picture. The picture content recognition model is an AI model for recognizing picture content, and its input may be an original picture, and output as a content description text corresponding to the original picture. For example, the picture content recognition model may employ a contrast language image pre-training (Contrastive Language-Image Pretraining, CLIP for short) model, which is a deep learning model with the ability to understand natural language descriptions and picture content.

In some embodiments, a main region in an original picture is filled, and a background picture corresponding to the original picture is generated. For example, the main body area in the original picture is filled by using the generating countermeasure network, and after multiple rounds of training, the generating countermeasure network can generate filling content similar to the surrounding on the main body area of the original picture, so as to obtain a background picture corresponding to the original picture. Generating the countermeasure network (Generative Adversarial Network, GAN) is a deep learning model for generating the composite data. The types of synthetic data are numerous and include, but are not limited to, the following types: pictures, text, music, models, etc.

In some embodiments, the background picture corresponding to the original picture is generated by generating content description text corresponding to the original picture by the countermeasure network, and filling the main body area in the original picture.

By the method, the content description text corresponding to the original picture is acquired, so that the filling content similar to the surrounding is generated in the main body area, the quality of the generated background picture is improved, and the naturalness and the authenticity of the generated background picture are improved.

Illustratively, the original picture 30 (as shown in fig. 3) and the content description text corresponding to the original picture are input into the GAN, after multiple rounds of training, the GAN generates filling content similar to the surrounding on the main area of the original picture, and the main area part of the filled original picture is connected with the whole picture naturally, so as to obtain a background picture corresponding to the original picture, which is shown in fig. 7, and is a schematic diagram of the background picture 70 corresponding to the original picture 30 provided in an embodiment of the present application.

By the method, the filling content similar to the surrounding is generated in the main body area in the original picture, so that the offensiveness of the background picture caused by the segmentation of the main body area is reduced, and the background picture with consistency and natural appearance can be obtained.

Optionally, post-processing the background picture is further included after step 240 to obtain a processed background picture. The post-processing is used for further improving the quality of the background picture.

In some embodiments, post-processing includes, but is not limited to, at least one of: color correction, flaw removal and edge repair. Color correction is a technique for adjusting the color of a picture to improve its visual quality, mainly to solve the problems of repairing or adjusting color deviations, white balance problems or other color-related inconsistencies in the picture. Removal of flaws is a technique for eliminating flaws, noise or defects in pictures, and may include, but is not limited to, the following flaws: picture noise, blemishes or blobs, glitches or artifacts, etc. Edge restoration is a technique for restoring damaged or missing edge information in a picture. Edges of a picture often contain important structures and features of the picture, so that edge repair techniques can be used to reconstruct or enhance edges when the picture is damaged or lacks edge information.

By the mode, the filled region in the background picture can be connected with the surrounding more naturally, and the quality of the background picture is further improved.

In summary, according to the technical scheme provided by the embodiment of the application, the original picture is subjected to region segmentation to obtain the segmentation map comprising a plurality of regions, then the main body region is confirmed from the segmentation map, the corresponding mask is generated, the main body picture is extracted from the original picture according to the mask, then the main body region in the original picture is filled according to the content description text corresponding to the original picture, so that the background picture corresponding to the original picture can be obtained, automation of separation of the main body picture and the background picture is realized, steps and difficulty of user operation are reduced, and the separation efficiency of the main body picture and the background picture is remarkably improved. And the main body area in the original picture is filled based on the content description text corresponding to the original picture, so that the generated filling content is more accurate, and the quality of the background picture is improved.

Next, a description will be given of a process of generating a background picture corresponding to an original picture by generating an countermeasure network.

In some embodiments, generating an countermeasure network includes a generator and a arbiter; filling a main body region in an original picture through a generator according to a content description text corresponding to the original picture, and generating a candidate background picture corresponding to the original picture; judging the candidate background picture and the original picture by a discriminator to obtain a judging result; determining a total loss from pixel level loss, perceptual loss, and generation fight loss; the pixel level loss is used for measuring pixel differences between candidate background pictures and original pictures, the perception loss is used for measuring differences between characteristic representations of the candidate background pictures and characteristic representations of the original pictures, and the contrast loss is generated and used for comprehensively measuring the performance of the generator and the performance of the discriminator; under the condition that the generation objective network does not meet the training stopping condition, according to the total loss, adjusting parameters of a generator and a discriminator, and filling a main body area in an original picture from a content description text corresponding to the original picture through the generator again, and executing the step of generating candidate background pictures corresponding to the original picture; and under the condition that the generated countermeasure network meets the training stopping condition, determining the candidate background picture generated last time as the background picture.

The generator is part of a generation countermeasure network and is primarily used to generate data similar to training data. In the application, the generator generates candidate background pictures corresponding to the original pictures according to the content description text corresponding to the original pictures, at this time, the training data corresponds to the original pictures, and the data generated by the generator is the candidate background pictures corresponding to the original pictures.

The arbiter is also part of the generation of the challenge network, and is mainly used for evaluating whether the input data is real data. In the present application, the discriminator is configured to evaluate the probability that the candidate background picture is an original picture, where the data input by the discriminator is the candidate background picture.

In some embodiments, the process of generating candidate background pictures by the generator is as follows: extracting picture characteristics of an original picture; fusing the picture characteristics of the original picture with the first picture characteristics to obtain fusion characteristics; the first picture features are from surrounding areas of a main body area in the original picture or from candidate background pictures generated by the generator last time; and filling the main body area in the original picture through the generator according to the fusion characteristics and the content description text corresponding to the original picture, and generating a candidate background picture corresponding to the original picture.

In some embodiments, the picture feature of the original picture and the first picture feature are two matrices of the same size. Fusing the picture features of the original picture with the first picture features means that the element values of the corresponding positions of the picture features of the original picture and the first picture features are subjected to weighted average or weighted summation to obtain the fused features.

In some embodiments, during the first round of training, since there is no candidate background picture at this time, the first picture features are from surrounding areas of the subject area in the original picture; during the training of the subsequent round, the first picture feature is from the candidate background picture last generated by the generator. The candidate background picture generated last time by the generator means the candidate background picture generated by the generator in the training process of the previous round.

By the method, the proper pixel value is generated in the main body area of the original picture, and the background picture similar to the original picture in style can be finally generated through continuous iterative training.

In the training process, the generator and the discriminator are mutually opposed, so that the quality of the generated pictures is improved by the generator.

The total loss of the generated countermeasure network is determined by pixel-level loss, perception loss and generated countermeasure loss, different loss focuses on different parts, the pixel-level loss focuses on the difference between the candidate background picture and the original picture, the perception loss focuses on the difference between the characteristic representation of the candidate background picture and the characteristic representation of the original picture, and the performance of the countermeasure loss focus generator and the performance of the discriminator are generated. For example, the total loss to generate the antagonism network can be expressed as:

L _total ＝L _pixel *λ _pixel +L _perceptual *λ _perceptual +L _adversarial *λ _adversarial Wherein L is _pixel 、L _perceptual And L _adversarial Respectively pixel level loss, perceptual loss and generation of contrast loss, lambda _pixel 、λ _perceptual And lambda (lambda) _adversarial The weights corresponding to the three loss functions respectively can be empirical values set in advance.

In some embodiments, the pixel-level loss function is calculated using a mean square error (Mean Squared Error, MSE, simply MSE). For example, for candidate background picture G and original picture Y, the pixel-level loss function may be expressed as:

where N is the total number of pixels of the candidate background picture G (candidate background picture G and original picture Y are equal in size), G _i And Y _i And respectively representing pixel values of the candidate background picture G and the original picture Y at a position i, wherein i is a positive integer and i is less than or equal to N.

In some embodiments, the perceptual loss is calculated by comparing the feature representations of the candidate background picture and the original picture in the CNN.

In some embodiments, inputting the candidate background picture and the original picture into a pre-trained CNN to obtain a feature representation of the candidate background picture and a feature representation of the original picture; and calculating to obtain the perception loss based on the characteristic representation of the candidate background picture and the characteristic representation of the original picture.

For example, after inputting the candidate background picture and the original picture into the pre-trained CNN to obtain the feature representation of the candidate background picture and the feature representation of the original picture, calculating the mean square error between them, the perceptual loss may be expressed as:

Where M is the total number of elements in the feature representation, F (G _j ) And F (Y) _j ) And respectively representing characteristic values of a position j in characteristic representations of the candidate background picture G and the original picture Y, wherein j is a positive integer and j is less than or equal to M.

In some embodiments, the generation counter loss is calculated using cross entropy loss.

In some embodiments, determining a loss of the generator according to a discrimination result of the discriminator for the candidate background picture, wherein the loss of the generator is used for measuring the proximity degree between the candidate background picture and the original picture; determining the loss of the discriminator according to the discrimination result of the discriminator for the candidate background picture and the discrimination result of the discriminator for the original picture, wherein the loss of the discriminator is used for measuring the discrimination precision of the discriminator; the generation of the countermeasures loss is determined based on the loss of the generator and the loss of the arbiter.

For example, the candidate background picture and the original picture are respectively input into the discriminator to obtain a probability matrix of the candidate background picture being the original picture and a probability matrix of the original picture being the original picture, and the binary cross entropy loss is used for calculation, and the loss of the discriminator can be expressed as:

L _{Discriminator} ＝-E(log D(Y _k )+log(1-D(G _k )))

the loss of the generator can be expressed as:

L _Generator ＝-E(log(D(G _k )))

the generation of the challenge loss can be expressed as:

L _adversarial ＝-E(log D(Y _k )+log(1-D(G _k )))

Wherein D (G) _k ) And D (Y) _k ) And respectively representing the output of the discriminators of the candidate background picture G and the original picture Y at the position j, wherein k is a positive integer, k is less than or equal to N, and N is the total number of pixels of the candidate background picture (the size of the candidate background picture G is equal to that of the original picture Y).

By the method, the countermeasures are generated through calculation, so that the performance of the generator and the arbiter can be optimized, and the generator can generate higher-quality filling content.

In some embodiments, the training stopping condition may be that the set training time is reached, or that the total loss is minimized, which is not limited in this application.

In some embodiments, the generator generates a candidate background picture based on the content description text corresponding to the original picture, the discriminator discriminates the candidate background picture and the original picture to obtain a discrimination result, and after calculating the total loss, the parameters of the generator or the discriminator are adjusted according to the total loss, and the adjustment of the parameters of the generator and the discriminator is alternately performed. The parameter adjustment of the generator and the parameter adjustment of the discriminator are alternately performed, that is, if the parameter adjustment of the generator is performed in the present training, the parameter adjustment of the discriminator is performed in the next training.

Through the mode, parameters of the generator and the discriminator are adjusted in turn according to the total loss, and filling content with higher quality is generated through continuous training, so that the background picture which is high in similarity between the filling content and the original picture and is natural in connection can be finally obtained.

The following description is presented in terms of one field Jing Jinhang of use of the present application.

Referring to fig. 8, a schematic diagram of a picture layering interface 80 of a picture processing tool according to an embodiment of the present application is shown. The function selection control 81 displays "picture layering" at this time, that is, the user is using the picture processing tool to perform picture layering operation at this time, and the scheme provided in the embodiment of the present application implements the picture layering operation. After uploading the original picture to the picture processing tool, a preview 83 of the original picture is displayed in the interface, and a person picture 84 (corresponding to the main picture described above) and a background picture 85 are obtained by clicking on the one-touch hierarchical control 82.

The flow chart of the method for realizing the picture layering operation is shown in fig. 9, and the image segmentation model is used for carrying out region segmentation on the input original picture to obtain a segmentation diagram corresponding to the original picture, wherein the character is a main body element and is marked as dark red (represented as #96053d by hexadecimal system) in the segmentation diagram. The image segmentation model may be a control_v11p_sd15_seg model, which is a model for realizing semantic segmentation, namely region segmentation, in a control net plug-in of Stable Difference, and the model includes CNN and FCN, and can realize region segmentation.

Each pixel in the segmentation map is traversed, and if the pixel is dark red, it is filled with white, and the other pixels are filled with black, resulting in a sheet Meng Bantu. If there is no dark red pixel in the segmentation map, i.e. there is no main body element in the original picture, the user needs to be prompted that there is no main body in the current picture. And traversing each pixel of the mask, and setting the corresponding pixel in the original picture to be transparent if the pixel value is 1, namely the main area, so as to obtain a main picture, namely the character picture 84.

And extracting the picture content of the original picture by using the picture content identification model to obtain a corresponding content description text. The picture content identification model can be a ViT-B-32/openai model, and the model is a model capable of identifying picture content and outputting content text in a clip-interposer-ext plug-in of Stable Difference.

Extracting features of an original picture by using CNN, then carrying out feature fusion to obtain fusion features, generating candidate background pictures by GAN according to the fusion features and content description text, continuously training to obtain filled background pictures, and carrying out post-processing on the filled background pictures to obtain final background pictures, namely the background pictures 85. The above-mentioned CNN and GAN works can be accomplished using a control_v11p_st15_inpainting model, which is a model with local redrawing capability in control net.

The technical scheme of the application can also be applied to scenes such as advertisement design and game design, and specific implementation modes can refer to the above embodiments and are not repeated.

The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Referring to fig. 10, a block diagram of a picture processing apparatus according to an embodiment of the present application is shown. The device has the function of realizing the method example, and the function can be realized by hardware or can be realized by executing corresponding software by hardware. The apparatus may be the computer device described above or may be provided in a computer device. As shown in fig. 10, the apparatus 1000 may include a region segmentation module 1010, a mask generation module 1020, a subject extraction module 1030, and a content population module 1040.

The region segmentation module 1010 is configured to perform region segmentation on an original picture to obtain a segmentation map corresponding to the original picture, where the segmentation map includes a plurality of regions.

And a mask generation module 1020, configured to generate a mask corresponding to the original picture based on at least one body region determined from the plurality of regions, where the mask is used to distinguish the body region from other regions except the body region in the original picture.

And the main body extraction module 1030 is configured to extract the main body region in the original picture according to the original picture and the mask, so as to obtain a main body picture corresponding to the original picture.

The content filling module 1040 is configured to fill the main area in the original picture according to the content description text corresponding to the original picture, and generate a background picture corresponding to the original picture.

In some embodiments, the content filling module 1040 is configured to fill the main area in the original picture by generating content description text corresponding to the original picture by using an antagonism network, so as to generate a background picture corresponding to the original picture.

In some embodiments, the generating an countermeasure network includes a generator and a arbiter. The content filling module 1040 includes: the generation sub-module, the discrimination sub-module, the loss sub-module, the adjustment sub-module, and the determination sub-module (not shown in fig. 10).

And the generation sub-module is used for filling the main body area in the original picture through the generator according to the content description text corresponding to the original picture, and generating a candidate background picture corresponding to the original picture.

And the judging sub-module is used for judging the candidate background picture and the original picture through the judging device to obtain a judging result.

A loss submodule for determining a total loss from pixel level loss, perceptual loss and generation of a counterloss; wherein the pixel level loss is used to measure pixel differences between the candidate background picture and the original picture, the perceptual loss is used to measure differences between the feature representation of the candidate background picture and the feature representation of the original picture, and the generation of the contrast loss is used to comprehensively measure the performance of the generator and the performance of the arbiter.

And the adjustment sub-module is used for adjusting the parameters of the generator and the discriminator according to the total loss under the condition that the generating countermeasure network does not meet the training stopping condition, and filling the main area in the original picture from the content description text corresponding to the original picture through the generator again, so that the step of generating the candidate background picture corresponding to the original picture is started to be executed.

And the determining submodule is used for determining the candidate background picture generated last time as the background picture under the condition that the generated countermeasure network meets the training stopping condition.

In some embodiments, the generating submodule is to: extracting picture characteristics of the original picture; fusing the picture features of the original picture with the first picture features to obtain fusion features; wherein the first picture feature is from a surrounding area of the subject area in the original picture or from the candidate background picture last generated by the generator; and filling the main body area in the original picture through the generator according to the fusion characteristics and the content description text corresponding to the original picture, and generating a candidate background picture corresponding to the original picture.

In some embodiments, the loss submodule is further to: determining the loss of the generator according to the judging result of the judging device aiming at the candidate background picture, wherein the loss of the generator is used for measuring the approaching degree between the candidate background picture and the original picture; determining the loss of the discriminator according to the discrimination result of the discriminator for the candidate background picture and the discrimination result of the discriminator for the original picture, wherein the loss of the discriminator is used for measuring the discrimination precision of the discriminator; determining the generation countermeasure loss based on the loss of the generator and the loss of the arbiter.

In some embodiments, the region segmentation module 1010 is to: extracting picture characteristics of the original picture; determining a category corresponding to each pixel in the original picture according to the picture characteristics of the original picture; obtaining a segmentation map corresponding to the original picture according to the category corresponding to each pixel in the original picture; wherein each pixel belonging to the same class corresponds to one region in the segmentation map and pixels in different regions correspond to different classes.

In some embodiments, the mask generation module 1020 is configured to: setting the pixel values of the main body area as a first value, and setting the pixel values of other areas except the main body area as a second value, so as to generate a mask corresponding to the original picture; wherein the first value and the second value are different.

In some embodiments, the apparatus 1000 further includes a text generation module (not shown in fig. 10) configured to identify the content of the original picture by using a picture content identification model, so as to obtain a content description text corresponding to the original picture.

In some embodiments, the apparatus 1000 further includes a post-processing module (not shown in fig. 10) configured to post-process the background picture to obtain the processed background picture; wherein the post-processing includes at least one of: color correction, flaw removal and edge repair.

It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

Referring to FIG. 11, a block diagram of a computer device 1100 according to one embodiment of the present application is shown. The computer device 1100 may be the computer device 10 in the implementation environment shown in fig. 1 for implementing the picture processing method provided in the above-described embodiment. Specifically, the present invention relates to a method for manufacturing a semiconductor device.

In general, the computer device 1100 includes: a processor 1110 and a memory 1120.

Processor 1110 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1110 may be implemented in at least one hardware form of digital signal processing (Digital Signal Processing, abbreviated as DSP), field programmable gate array (Field Programmable Gate Array, abbreviated as FPGA), programmable logic array (Programmable Logic Array, abbreviated as PLA). The processor 1110 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a central processor (Central Processing Unit, abbreviated as CPU); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1110 may be integrated with an image processor (Graphics Processing Unit, GPU for short) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 1110 may also include an AI processor for processing computing operations related to machine learning.

Memory 1120 may include one or more computer-readable storage media, which may be non-transitory. Memory 1120 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1120 is used to store a computer program configured to be executed by one or more processors to implement the above-described picture processing methods.

Those skilled in the art will appreciate that the architecture shown in fig. 11 is not limiting as to the computer device 1100, and may include more or fewer components than shown, or may combine certain components, or employ a different arrangement of components.

In an exemplary embodiment, a computer readable storage medium is also provided, in which a computer program is stored which, when being executed by a processor, implements the above-mentioned picture processing method. Alternatively, the computer-readable storage medium may include: read-Only Memory (ROM), random access Memory (Random Access Memory RAM), solid state disk (Solid State Drives SSD), optical disk, or the like. The random access memory may include a resistive random access memory (Resistance Random Access Memory, reRAM) and a dynamic random access memory (Dynamic Random Access Memory, DRAM).

In an exemplary embodiment, a computer program product is also provided, the computer program product comprising a computer program stored in a computer readable storage medium. A processor of a computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device executes the above-described picture processing method.

It should be noted that, in the application, the relevant data (such as pictures) should be collected and processed strictly according to the requirements of relevant national laws and regulations during the application of the examples, so as to obtain the informed consent or independent consent of the personal information body, and develop the subsequent data use and processing behaviors within the authorized scope of the laws and regulations and the personal information body.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. In addition, the step numbers described herein are merely exemplary of one possible execution sequence among steps, and in some other embodiments, the steps may be executed out of the order of numbers, such as two differently numbered steps being executed simultaneously, or two differently numbered steps being executed in an order opposite to that shown, which is not limited by the embodiments of the present application.

The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims

1. A picture processing method, the method comprising:

2. The method of claim 1, wherein the content description text corresponding to the original picture fills the main area in the original picture, and generating the background picture corresponding to the original picture comprises:

And filling the main body area in the original picture by generating a content description text corresponding to the original picture through a countermeasure network, and generating a background picture corresponding to the original picture.

3. The method of claim 2, wherein the generating an countermeasure network comprises a generator and a arbiter;

the step of filling the main body area in the original picture by generating a content description text corresponding to the original picture through a countermeasure network, and generating a background picture corresponding to the original picture comprises the following steps:

filling the main body region in the original picture through the generator according to the content description text corresponding to the original picture, and generating a candidate background picture corresponding to the original picture;

judging the candidate background picture and the original picture by the judging device to obtain a judging result;

determining a total loss from pixel level loss, perceptual loss, and generation fight loss; wherein the pixel level loss is used to measure pixel differences between the candidate background picture and the original picture, the perceptual loss is used to measure differences between the feature representation of the candidate background picture and the feature representation of the original picture, and the generation countermeasure loss is used to comprehensively measure the performance of the generator and the performance of the arbiter;

Under the condition that the generating countermeasure network does not meet the training stopping condition, according to the total loss, adjusting parameters of the generator and the discriminator, and filling the main body area in the original picture from the content description text corresponding to the original picture through the generator again, and generating candidate background pictures corresponding to the original picture;

and under the condition that the generated countermeasure network meets the training stopping condition, determining the candidate background picture generated last time as the background picture.

4. The method according to claim 3, wherein the generating, by the generator, the candidate background picture corresponding to the original picture by filling the subject area in the original picture according to the content description text corresponding to the original picture includes:

extracting picture characteristics of the original picture;

fusing the picture features of the original picture with the first picture features to obtain fusion features; wherein the first picture feature is from a surrounding area of the subject area in the original picture or from the candidate background picture last generated by the generator;

And filling the main body area in the original picture through the generator according to the fusion characteristics and the content description text corresponding to the original picture, and generating a candidate background picture corresponding to the original picture.

5. A method according to claim 3, characterized in that the method further comprises:

determining the loss of the generator according to the judging result of the judging device aiming at the candidate background picture, wherein the loss of the generator is used for measuring the approaching degree between the candidate background picture and the original picture;

determining the loss of the discriminator according to the discrimination result of the discriminator for the candidate background picture and the discrimination result of the discriminator for the original picture, wherein the loss of the discriminator is used for measuring the discrimination precision of the discriminator;

determining the generation countermeasure loss based on the loss of the generator and the loss of the arbiter.

6. The method according to claim 1, wherein the performing region segmentation on the original picture to obtain the segmentation map corresponding to the original picture includes:

extracting picture characteristics of the original picture;

Determining a category corresponding to each pixel in the original picture according to the picture characteristics of the original picture;

obtaining a segmentation map corresponding to the original picture according to the category corresponding to each pixel in the original picture; wherein each pixel belonging to the same class corresponds to one region in the segmentation map and pixels in different regions correspond to different classes.

7. The method of claim 1, wherein generating the mask corresponding to the original picture based on the at least one subject region determined from the plurality of regions comprises:

setting the pixel values of the main body area as a first value, and setting the pixel values of other areas except the main body area as a second value, so as to generate a mask corresponding to the original picture; wherein the first value and the second value are different.

8. The method according to claim 1, wherein the method further comprises:

and identifying the content of the original picture by adopting a picture content identification model to obtain a content description text corresponding to the original picture.

9. The method according to claim 1, wherein the filling the main area in the original picture according to the content description text corresponding to the original picture, after generating the background picture corresponding to the original picture, further comprises:

Post-processing the background picture to obtain the processed background picture; wherein the post-processing includes at least one of: color correction, flaw removal and edge repair.

10. A picture processing apparatus, the apparatus comprising:

11. A computer device, characterized in that it comprises a processor and a memory, in which a computer program is stored, which processor executes the computer program to implement the method according to any of claims 1 to 9.

12. A computer readable storage medium, characterized in that the storage medium has stored therein a computer program for execution by a processor for implementing the method according to any one of claims 1 to 9.

13. A computer program product, characterized in that the computer program product comprises a computer program that is loaded and executed by a processor to implement the method of any one of claims 1 to 9.