CN113538227A - Image processing method based on semantic segmentation and related equipment - Google Patents

Image processing method based on semantic segmentation and related equipment Download PDF

Info

Publication number
CN113538227A
CN113538227A CN202010313277.5A CN202010313277A CN113538227A CN 113538227 A CN113538227 A CN 113538227A CN 202010313277 A CN202010313277 A CN 202010313277A CN 113538227 A CN113538227 A CN 113538227A
Authority
CN
China
Prior art keywords
semantic
images
image
resolution
semantic segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010313277.5A
Other languages
Chinese (zh)
Other versions
CN113538227B (en
Inventor
张欢
陈刚
马飞龙
田晶铎
李莹莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202010313277.5A priority Critical patent/CN113538227B/en
Priority claimed from CN202010313277.5A external-priority patent/CN113538227B/en
Publication of CN113538227A publication Critical patent/CN113538227A/en
Application granted granted Critical
Publication of CN113538227B publication Critical patent/CN113538227B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4053Super resolution, i.e. output image resolution higher than sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image

Abstract

The embodiment of the application discloses an image processing method based on semantic segmentation and related equipment, which can be particularly applied to the fields of image processing, intelligent photography and the like, wherein the image processing method based on semantic segmentation can comprise the following steps: acquiring a target image; inputting the target image into a semantic segmentation network to obtain a target semantic segmentation map of the target image, wherein the target semantic segmentation map comprises K first semantic regions and P second semantic regions; and inputting the target semantic segmentation map and the target image into a super-resolution network, performing first super-resolution processing on the target image according to the K first semantic regions, and performing second super-resolution processing on the target image according to the P second semantic regions to obtain a super-resolution image corresponding to the target image. Therefore, the obtained super-resolution image is real and natural and conforms to the rich texture of the actual scene.

Description

Image processing method based on semantic segmentation and related equipment
Technical Field
The present application relates to the field of image processing technologies, and in particular, to an image processing method and related device based on semantic segmentation.
Background
The Super-Resolution (SR) technology is a technology for reconstructing a corresponding high-Resolution image from an observed low-Resolution image, and has important application values in the fields of photographing detail improvement, monitoring equipment, satellite images, medical images and the like. The super-resolution technology can be divided into two categories: a high resolution image is reconstructed from a plurality of low resolution images and a high resolution image is reconstructed from a single low resolution image. The Super-Resolution technology based on deep learning is mainly a Super-Resolution reconstruction method (SISR) based on a Single low-Resolution Image.
Although the super-resolution technology based on the deep learning can obtain a corresponding high-resolution image from a single low-resolution image, the photographing quality of the mobile phone is remarkably improved, and the like. However, the current SISR cannot adaptively enhance each region of the input low-resolution image with different texture intensities (i.e. enhance each region of the resolution image with different degrees of detail). Therefore, in the process of performing super-resolution reconstruction on a low-resolution image, the conventional super-resolution technology often increases the importance on the improvement of details, so that the finally obtained high-resolution image is easy to generate false image information and generate false textures. For example, in a certain region of an actual scene where no content exists, false content appears in a high-resolution image obtained after processing, and a region where no texture exists generates a false texture.
Therefore, how to make the high-resolution image obtained after the super-resolution processing appear true and avoid generating redundant pseudo-textures is an urgent problem to be solved.
Disclosure of Invention
The embodiment of the application provides an image processing method based on semantic segmentation and related equipment, so as to improve the authenticity of a super-resolution image obtained after the super-resolution processing is performed on a low-resolution image and avoid redundant pseudo-texture generated in the super-resolution image.
In a first aspect, an embodiment of the present application provides an image processing method based on semantic segmentation, which may include: acquiring a target image; inputting the target image into a semantic segmentation network to obtain a target semantic segmentation map of the target image, wherein the target semantic segmentation map comprises K first semantic regions and P second semantic regions; each first semantic area in the K first semantic areas is an area obtained by segmentation according to a preset semantic category; each second semantic area in the P second semantic areas is an area, with the image frequency being smaller than a first preset value, of the target image and used for texture addition in the super-resolution processing; K. p is an integer greater than or equal to 1; inputting the target semantic segmentation map and the target image into a super-resolution network, performing first super-resolution processing on the target image according to the K first semantic regions, and performing second super-resolution processing on the target image according to the P second semantic regions to obtain a super-resolution image corresponding to the target image; the resolution of the super-resolution image is greater than or equal to the resolution of the target image.
By the method provided by the first aspect, a shot target image (low-resolution image) may be first input to a semantic segmentation network obtained through pre-training, so as to obtain a target semantic segmentation map of the target image, where the target semantic segmentation map may include a plurality of first semantic regions (for example, regions obtained by segmentation according to common semantic categories such as sky, buildings, people, plants, animals, water surfaces, roads, bridges, vehicles, traffic signals, and the like) and a plurality of second semantic regions (for example, regions where image frequency in the target image is lower than a first preset value and used for texture addition in super-resolution processing, that is, weak texture regions where textures are difficult to recover). And then, transmitting the obtained target semantic segmentation image and the target image to a super-resolution network obtained by pre-training, and respectively performing different super-resolution processing on two semantic regions obtained by semantic segmentation. For example, for a first semantic region with an image frequency greater than or equal to a first preset value among the plurality of first semantic regions, corresponding texture addition may be performed under the guidance of semantics according to the respective corresponding semantic categories, so as to enhance the texture and improve details of the image (for example, for the first semantic region corresponding to the semantics of plants, corresponding textures belonging to the plants are added). For example, texture addition is not performed on the first semantic region with the image frequency smaller than the first preset value, so that redundant texture which is not consistent with the actual scene (i.e. generation of pseudo texture) in the super-resolution image is avoided. For another example, for the plurality of second semantic regions, the super-resolution network obtained through the pre-training may be used to match the corresponding textures, and perform texture addition on the corresponding regions, so as to recover the real texture of such weak texture regions that are difficult to recover. Therefore, compared with the prior art, the method has the advantages that the semantic segmentation is carried out on the low-resolution image only according to the common semantic categories, the super-resolution processing is carried out on the low-resolution image under the guidance of different semantics according to each semantic region obtained by segmentation, and the texture is added, so that the false and unreal redundant textures are easily generated. According to the method and the device, the low-resolution images can be subjected to more precise semantic segmentation according to common semantic categories and weak texture regions difficult to recover, a plurality of first semantic regions and a plurality of second semantic regions are obtained, different super-resolution processing is respectively performed on the two semantic regions through the super-resolution network obtained through pre-training, real and natural textures are recovered, redundant pseudo textures are prevented from being generated, authenticity of the super-resolution images is guaranteed, and photographing experience of users is improved. It should be noted that the details and textures of the super-resolution image are richer than those of the target image, but the resolution of the super-resolution image may be greater than or equal to that of the target image.
In one possible implementation, the preset semantic category includes one or more of sky, buildings, people, plants, animals, water, roads, bridges, vehicles, and traffic signals; the performing a first super-resolution process on the target image according to the K first semantic regions and performing a second super-resolution process on the target image according to the P second semantic regions includes: determining M first semantic regions in the K first semantic regions, and adding textures to the corresponding M regions in the target image according to the preset semantic categories corresponding to the M first semantic regions respectively; each of the M first semantic regions is a region in the target image, wherein the image frequency of the region is greater than or equal to the first preset value; and respectively adding textures to the corresponding P areas in the target image according to the P second semantic areas.
In the embodiment of the present application, each first semantic region may be a region segmented according to common semantic categories including sky, buildings, people, plants, animals, water surface, roads, bridges, vehicles, traffic signals, and the like. The image frequency of each first semantic region may be different, and for a first semantic region whose image frequency is greater than or equal to a first preset value, texture addition may be performed on the corresponding region in the target image according to the corresponding semantic category (for example, a certain first semantic region whose image frequency is greater than the first preset value and belonging to plant semantics may be added to the region corresponding to the first semantic region in the target image under the guidance of plant semantic information), so that the added texture is more natural and real, and conforms to the actual scene, thereby greatly improving the quality of the super-resolution image. On the other hand, for example, for a first semantic region (e.g., a blue and cloudless sky region, which may include only low-frequency information, such as a simple blue color block) with an image frequency less than a first preset value, no texture addition may be performed, so as to avoid an unnecessary texture that does not correspond to the actual scene in the super-resolution image (i.e., avoid generation of a pseudo texture). For a second semantic region, that is, a region (such as a roof or the like) where there are abundant textures in an actual scene but there are almost no textures in a target image (a low-resolution image), matching the corresponding textures through a pre-trained super-resolution network, and performing corresponding texture addition on the plurality of second semantic regions respectively to recover the real textures in the actual scene. Therefore, whether texture addition is needed or not is judged for different regions of the target image under the knowledge and the constraint of more detailed semantic information aiming at different semantic regions, and if so, corresponding texture addition is carried out. The authenticity of the super-resolution image obtained after the super-resolution processing is ensured, and the photographing experience of a user is improved.
In one possible implementation, the method further includes: determining Q first semantic regions in the K first semantic regions, and according to the Q first semantic regions, not adding textures to the corresponding Q regions in the target image; each of the Q first semantic regions is a region of which the image frequency in the target image is smaller than the first preset value; m, Q is an integer greater than or equal to 0 and the sum of M and Q is K.
In the embodiment of the application, for a first semantic region (for example, a blue and cloudless sky region, which may only include low-frequency information, such as a simple blue color block) with an image frequency smaller than a first preset value, texture addition may not be performed, so that an unnecessary texture that is not in accordance with an actual scene is avoided from occurring in the super-resolution image (that is, generation of a pseudo texture is avoided), authenticity of the super-resolution image obtained after the super-resolution processing is ensured, and a photographing experience of a user is improved.
In one possible implementation, the method further includes: acquiring a first image set and a second image set, wherein the first image set comprises N first images, the second image set comprises N second images, the N first images are in one-to-one correspondence with the N second images, and the resolution of each second image in the N second images is greater than that of each corresponding first image in the N first images; n is an integer greater than or equal to 1; performing first semantic segmentation on the N first images according to the preset semantic categories to obtain N first semantic segmentation images corresponding to the N first images; each of the N first semantic segmentation maps comprises one or more first semantic regions; frequency analysis is carried out on the N first images and the N second images, and N first frequency graphs corresponding to the N first images and N second frequency graphs corresponding to the N second images are obtained respectively; and performing second semantic segmentation on the N first images according to preset conditions based on the N first frequency graphs and the N second frequency graphs to obtain N second semantic segmentation graphs corresponding to the N first images, wherein each second semantic segmentation graph in the N second semantic segmentation graphs comprises one or more first semantic regions and one or more second semantic regions.
In this embodiment of the application, after a first image set (for example, including a plurality of first images with lower resolutions) and a second image set (for example, including a plurality of second images with higher resolutions corresponding to the first images one by one, the second images being high-resolution versions of the corresponding first images, and image contents of the two images being the same) are acquired, first semantic segmentation is performed on each first image according to a preset semantic category (for example, a developer may select and label each semantic area by using an existing semantic labeling tool, so as to implement first semantic segmentation on each first image), and a first semantic segmentation image corresponding to each first image is obtained. Then, based on the first frequency map and the second frequency map obtained by performing frequency analysis on each first image and each second image, performing second semantic segmentation on each first image (for example, a developer may use an existing semantic annotation tool to frame and label a region with a larger frequency difference between the first frequency map and the second frequency map in the first image, so as to implement second semantic segmentation on each first image), and obtaining a second semantic segmentation map corresponding to each first image. Therefore, the N second semantic segmentation graphs can be obtained quickly and accurately, a large amount of effective training data is provided for the semantic segmentation network and the super-resolution network, and the semantic segmentation network and the super-resolution network obtained through training are more efficient and accurate.
In a possible implementation manner, the performing, based on the N first frequency maps and the N second frequency maps, second semantic segmentation on the N first images according to a preset condition to obtain N second semantic segmentation maps corresponding to the N first images includes: comparing the N first frequency graphs with the N second frequency graphs in a one-to-one correspondence manner, and determining one or more regions, of which the difference values of the image frequency in the ith first frequency graph of the N first frequency graphs and the image frequency in the ith second frequency graph of the corresponding N second frequency graphs are greater than a second preset value, as one or more second semantic regions, wherein each region of the one or more regions is a region, of which the image frequency in the ith first frequency graph is less than the first preset value; performing second semantic segmentation on the ith first image corresponding to the ith first frequency map according to the one or more second semantic regions to obtain a second semantic segmentation map corresponding to the ith first image, wherein the second semantic segmentation map corresponding to the ith first image comprises the one or more first semantic regions and the one or more second semantic regions; i is an integer greater than or equal to 1 and less than or equal to N.
In the embodiment of the present application, each first image is subjected to second semantic segmentation (for example, a developer may use an existing semantic annotation tool to select and annotate a region with a larger frequency difference between the first frequency map and the second frequency map in the first image, so as to implement the second semantic segmentation on each first image), so as to obtain a second semantic segmentation map corresponding to each first image. Therefore, the N second semantic segmentation graphs can be obtained quickly and accurately, a large amount of effective training data is provided for the semantic segmentation network and the super-resolution network, and the training efficiency of the semantic segmentation network and the super-resolution network is improved.
In one possible implementation, the method further includes: acquiring a first training sample set, wherein the first training sample set comprises the N first images and the N second semantic segmentation maps; and training to obtain the semantic segmentation network by taking the N first images and the N second semantic segmentation maps as training input and taking the N second semantic segmentation maps as N labels.
In the embodiment of the application, N pieces of first images and N pieces of second semantic segmentation maps are used as training data, a complete semantic segmentation network is obtained through efficient and accurate training, and the semantic segmentation network can be used for performing fine semantic segmentation on low-resolution images, so that the semantic segmentation maps including a first semantic region and a second semantic region are obtained. And then, fine semantic guidance is provided for subsequent super-resolution processing, so that the texture of the low-resolution image is correspondingly enhanced under the guidance of each semantic region, the interference among all semantic regions is reduced, the authenticity of the super-resolution image is ensured, and the photographing experience of a user is improved.
In one possible implementation, the method further includes: acquiring a second training sample set, wherein the second training sample set comprises the N first images, the N second semantic segmentation graphs and the N second images; and training to obtain the super-resolution network by taking the N first images, the N second semantic segmentation graphs and the N second images as training input and taking the N second images as N labels.
In the embodiment of the application, N first images, N second semantic segmentation maps and N second images are used as training data, a perfect super-resolution network is obtained through efficient and accurate training, the super-resolution network can conduct different super-resolution processing on regions different from semantic regions under the guidance of various semantics based on the semantic segmentation maps of the low-resolution images, the image resolution can be improved, real and natural textures can be restored, redundant pseudo textures are prevented from being generated, the authenticity of the super-resolution images is guaranteed, and the photographing experience of a user is improved.
In a possible implementation manner, the training with the N first images, the N second semantic segmentation maps, and the N second images as training inputs and with the N second images as N labels to obtain the super-resolution network includes: inputting the N first images and the N second semantic segmentation maps into an initial neural network, extracting the features of the N first images based on the N second semantic segmentation maps, and performing upsampling processing on the N first images after feature extraction to obtain N third images corresponding to the N first images; and taking the N second images as N labels, performing loss calculation on the N third images based on the N second semantic segmentation graphs, and correcting one or more parameters in the initial neural network to obtain the super-resolution network.
In the embodiment of the application, N first images and N second semantic segmentation maps may be input into an initial neural network, and then based on the N second semantic segmentation maps, feature extraction is performed on the N first images, and upsampling is performed on the N first images after feature extraction, so as to obtain N third images corresponding to the N first images; and finally, guiding loss calculation based on the N third semantic segmentation maps, and continuously correcting one or more parameters of the initial neural network. Therefore, the perfect super-resolution network can be obtained through rapid and effective training and is used for carrying out image processing based on semantic segmentation on the low-resolution images subsequently, the super-resolution images with rich details and real and natural textures are obtained, and the photographing experience of a user is improved.
In a possible implementation manner, the target image, the N first images, and the N second images are images obtained by shooting for a scene in a target geographic area; the semantic segmentation network is a semantic segmentation network aiming at the target geographic area, and the super-resolution network is a super-resolution network aiming at the target geographic area.
In the embodiment of the application, a semantic segmentation network and a super-resolution network for a certain geographic area can be obtained by training a low-resolution image and a high-resolution image which are shot in the geographic area, so that when a user shoots a certain scene in the geographic area through terminal equipment such as a smart phone, an original image of the user may be a low-resolution image, the low-resolution image can be subjected to semantic segmentation and super-resolution processing based on the semantic segmentation network and the super-resolution network which are obtained through pre-training, a super-resolution image which is high in resolution, rich in details, real in texture and natural is obtained, and the shooting experience of the user is improved.
In a second aspect, an embodiment of the present application provides an image processing apparatus based on semantic segmentation, which may include: a first acquisition unit configured to acquire a target image; the first semantic segmentation unit is used for inputting the target image into a semantic segmentation network to obtain a target semantic segmentation map of the target image, wherein the target semantic segmentation map comprises K first semantic regions and P second semantic regions; each first semantic area in the K first semantic areas is an area obtained by segmentation according to a preset semantic category; each second semantic area in the P second semantic areas is an area, with the image frequency being smaller than a first preset value, of the target image and used for texture addition in the super-resolution processing; K. p is an integer greater than or equal to 1; the super-resolution unit is used for inputting the target semantic segmentation map and the target image into a super-resolution network, performing first super-resolution processing on the target image according to the K first semantic regions, and performing second super-resolution processing on the target image according to the P second semantic regions to obtain a super-resolution image corresponding to the target image; the resolution of the super-resolution image is greater than or equal to the resolution of the target image.
In one possible implementation, the preset semantic category includes one or more of sky, buildings, people, plants, animals, water, roads, bridges, vehicles, and traffic signals; the super-resolution unit is specifically used for: determining M first semantic regions in the K first semantic regions, and adding textures to the corresponding M regions in the target image according to the preset semantic categories corresponding to the M first semantic regions respectively; each of the M first semantic regions is a region in the target image, wherein the image frequency of the region is greater than or equal to the first preset value; and respectively adding textures to the corresponding P areas in the target image according to the P second semantic areas.
In a possible implementation manner, the super-resolution unit is further specifically configured to: determining Q first semantic regions in the K first semantic regions, and according to the Q first semantic regions, not adding textures to the corresponding Q regions in the target image; each of the Q first semantic regions is a region of which the image frequency in the target image is smaller than the first preset value; m, Q is an integer greater than or equal to 0 and the sum of M and Q is K.
In one possible implementation, the apparatus further includes: a second obtaining unit, configured to obtain a first image set and a second image set, where the first image set includes N first images, the second image set includes N second images, the N first images are in one-to-one correspondence with the N second images, and a resolution of each of the N second images is greater than a resolution of each of the N first images; n is an integer greater than or equal to 1; the second semantic segmentation unit is used for performing first semantic segmentation on the N first images according to the preset semantic categories to obtain N first semantic segmentation maps corresponding to the N first images; each of the N first semantic segmentation maps comprises one or more first semantic regions; a frequency analyzing unit, configured to perform frequency analysis on the N first images and the N second images to obtain N first frequency graphs corresponding to the N first images and N second frequency graphs corresponding to the N second images, respectively; and the third semantic segmentation unit is used for performing second semantic segmentation on the N first images according to preset conditions based on the N first frequency maps and the N second frequency maps to obtain N second semantic segmentation maps corresponding to the N first images, wherein each second semantic segmentation map in the N second semantic segmentation maps comprises the one or more first semantic regions and the one or more second semantic regions.
In a possible implementation manner, the third semantic segmentation unit is specifically configured to include: comparing the N first frequency graphs with the N second frequency graphs in a one-to-one correspondence manner, and determining one or more regions, of which the difference values of the image frequency in the ith first frequency graph of the N first frequency graphs and the image frequency in the ith second frequency graph of the corresponding N second frequency graphs are greater than a second preset value, as one or more second semantic regions, wherein each region of the one or more regions is a region, of which the image frequency in the ith first frequency graph is less than the first preset value; performing second semantic segmentation on the ith first image corresponding to the ith first frequency map according to the one or more second semantic regions to obtain a second semantic segmentation map corresponding to the ith first image, wherein the second semantic segmentation map corresponding to the ith first image comprises the one or more first semantic regions and the one or more second semantic regions; i is an integer greater than or equal to 1 and less than or equal to N.
In one possible implementation, the apparatus further includes: a third obtaining unit, configured to obtain a first training sample set, where the first training sample set includes the N first images and the N second semantic segmentation maps; and the first training unit is used for training to obtain the semantic segmentation network by taking the N first images and the N second semantic segmentation maps as training input and taking the N second semantic segmentation maps as N labels.
In one possible implementation, the apparatus further includes: a fourth obtaining unit, configured to obtain a second training sample set, where the second training sample set includes the N first images, the N second semantic segmentation maps, and the N second images; and the second training unit is used for training to obtain the super-resolution network by taking the N first images, the N second semantic segmentation graphs and the N second images as training input and taking the N second images as N labels.
In a possible implementation manner, the second training unit is specifically configured to: inputting the N first images and the N second semantic segmentation maps into an initial neural network, extracting the features of the N first images based on the N second semantic segmentation maps, and performing upsampling processing on the N first images after feature extraction to obtain N third images corresponding to the N first images; and taking the N second images as N labels, performing loss calculation on the N third images based on the N second semantic segmentation graphs, and correcting one or more parameters in the initial neural network to obtain the super-resolution network.
In a possible implementation manner, the target image, the N first images, and the N second images are images obtained by shooting for a scene in a target geographic area; the semantic segmentation network is a semantic segmentation network aiming at the target geographic area, and the super-resolution network is a super-resolution network aiming at the target geographic area.
In a third aspect, a terminal device provided in an embodiment of the present application is characterized in that the terminal device includes a processor, and the processor is configured to support the terminal device to implement a corresponding function in the image processing method based on semantic segmentation provided in the first aspect. The terminal device may also include a memory, coupled to the processor, that stores program instructions and data necessary for the terminal device. The terminal device may also include a communication interface for the terminal device to communicate with other devices or a communication network.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the flow of the image processing method based on semantic segmentation in any one of the above first aspects is implemented.
In a fifth aspect, the present application provides a computer program, where the computer program includes instructions, and when the computer program is executed by a computer, the computer may execute the flow of the image processing method based on semantic segmentation in any one of the above first aspects.
In a sixth aspect, an embodiment of the present application provides a chip system, where the chip system includes the image processing apparatus based on semantic segmentation according to any one of the above first aspects, and is configured to implement the function related to the flow of the image processing method based on semantic segmentation according to any one of the above first aspects. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the image processing method based on semantic segmentation. The chip system may be constituted by a chip, or may include a chip and other discrete devices.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the embodiments or the background of the present application will be described below.
Fig. 1 is a schematic diagram of a super-resolution reconstruction process in the prior art.
Fig. 2 is a schematic diagram of a process of performing super-resolution image reconstruction based on an SFT-GAN network in the prior art.
Fig. 3 is a functional block diagram of a terminal device according to an embodiment of the present application.
Fig. 4 is a block diagram of a software structure of a terminal device according to an embodiment of the present application.
Fig. 5 is a schematic view of an application scenario of an image processing method based on semantic segmentation according to an embodiment of the present application.
Fig. 6 a-6 c are a set of schematic interfaces provided by embodiments of the present application.
Fig. 7 is a schematic view of an application scenario of another image processing method based on semantic segmentation according to an embodiment of the present application.
Fig. 8 is a schematic flowchart of an image processing method based on semantic segmentation according to an embodiment of the present application.
Fig. 9 is a schematic diagram of a process of performing super-resolution image reconstruction based on semantic segmentation according to an embodiment of the present application.
Fig. 10 is a schematic diagram of a target semantic segmentation graph provided in an embodiment of the present application.
Fig. 11 is a schematic diagram of frequency resolution provided in an embodiment of the present application.
Fig. 12 is a schematic diagram of a process for obtaining a second semantic segmentation map according to an embodiment of the present application.
Fig. 13 is a schematic diagram of a training process of a super-resolution network according to an embodiment of the present application.
Fig. 14 is a schematic structural diagram of an image processing apparatus based on semantic segmentation according to an embodiment of the present application.
Fig. 15 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
The embodiments of the present application will be described below with reference to the drawings.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
As used in this specification, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a terminal device and the terminal device can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between 2 or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from two components interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).
First, some terms in the present application are explained so as to be easily understood by those skilled in the art.
(1) Convolutional Neural Network (CNN), a feedforward Neural Network, whose artificial neurons can respond to a portion of the coverage of surrounding cells, performs well for large image processing. It includes a convolutional layer (convolutional layer) and a pooling layer (Pooling layer). CNN is mainly used for identifying two-dimensional graphs of displacement, scaling and other forms of distortion invariance, and part of functions are mainly realized by a pooling layer. Since the feature detection layer of CNN learns from the training data, explicit feature extraction is avoided when CNN is used, while learning from the training data is implicit; moreover, because the weights of the neurons on the same feature mapping surface are the same, the network can learn in parallel, which is also a great advantage of the convolutional network relative to the network in which the neurons are connected with each other. The convolution neural network has unique superiority in the aspects of voice recognition and image processing by virtue of a special structure with shared local weight, the layout of the convolution neural network is closer to that of an actual biological neural network, the complexity of the network is reduced by virtue of weight sharing, and particularly, the complexity of data reconstruction in the processes of feature extraction and classification is avoided by virtue of the characteristic that an image of a multi-dimensional input vector can be directly input into the network. Super-Resolution technology (SRCNN) based on convolutional neural network is also an important content of deep learning used in Super-Resolution reconstruction.
(2) Super-Resolution technology (SR), or Super-Resolution reconstruction, is to improve the Resolution of an original image by a hardware or software method, and to reconstruct a corresponding high-Resolution image by a single or a series of low-Resolution images, or Super-Resolution reconstruction. Among them, the SR based on deep learning is mainly based on Super-Resolution (SISR) of a Single low-Resolution Image. Referring to fig. 1, fig. 1 is a schematic diagram illustrating a super-resolution reconstruction process in the prior art. As shown in fig. 1, details in the low-resolution image can be reconstructed by the super-resolution convolutional neural network, and a fine high-resolution image with rich details and multiple textures is obtained. The super-resolution technology has important application value in the fields of monitoring equipment, satellite images, medical images and the like.
(3) Semantic segmentation, or called image semantic segmentation (semantic segmentation), is a very important field in computer vision, and refers to identifying an image at a pixel level, i.e. labeling an object class to which each pixel in the image belongs. The semantic segmentation can be applied to the fields of geographic information systems, unmanned vehicle driving, medical image analysis, robots and the like.
First, in order to facilitate understanding of the embodiments of the present application, technical problems to be specifically solved by the present application are further analyzed and presented. In the prior art, the super-resolution technology of images includes various technical solutions, and the following exemplary list is one of the solutions that is commonly used.
The first scheme is as follows: and restoring the realistic texture in the super-resolution of the image by utilizing the depth space characteristic transformation.
The recovery of Realistic Texture (rendering real Texture in Image Super-resolution by Deep Spatial Feature Transform) in Super-resolution of images is the content of a paper published in the prior art for Super-resolution. The core of the method is a space Feature Transform-based Generative adaptive Networks (SFT-GAN), please refer to fig. 2, and fig. 2 is a schematic diagram of a process for reconstructing image super resolution based on an SFT-GAN network in the prior art. As shown in fig. 2, the main idea of the SFT-GAN network is to input semantic segmentation probability maps (semantic segmentation probability maps) as prior information into the SR network, and integrate image information and semantic information together by using a fusion module, thereby reducing interference between different semantics and promoting details under each semantic to be correspondingly enhanced. As shown in fig. 2, the network includes a Spatial Feature Transform layer (SFT layer), which can perform affine transformation on the intermediate features of the network, and the transformed parameters are obtained by transforming several layers of neural networks with additional prior conditions (such as the semantic segmentation probability map shown in fig. 2). The spatial feature conversion layer can be conveniently integrated into existing super-resolution networks, such as SRResNet and the like. In order to improve the algorithm efficiency, the semantic segmentation probability map is not directly input into the Network, but is learned through a shallow CNN, as shown in fig. 2, the semantic segmentation probability map is firstly subjected to a conditional Network (Condition Network) to obtain shared intermediate conditions, and then the conditions are "broadcasted" to all SFT layers. As shown in fig. 2, in the training of the network, a perceptual loss (perceptual loss) and a antagonism loss (GAN loss) are used at the same time. Therefore, as described above, the first scheme can correspondingly enhance details in each semantic in the low-resolution image under the constraint and guidance of semantic information, so as to generate more natural textures, and make the visual effect of the high-resolution image obtained by super-resolution reconstruction more real.
The first scheme has the following disadvantages: the semantic segmentation classes of the SFT-GAN network mainly include common sky, buildings, people, plants, animals, water surfaces, roads, bridges, traffic signals, vehicles, and the like, so that corresponding texture addition can be performed according to the semantics of different regions. However, in the training process of the SFT-GAN network, when the high frequency information of the partial region of the low-resolution image in the training data is not visible (i.e., the image frequency of the partial region of the low-resolution image is low, and the partial region has almost no texture), but the high frequency information exists in the partial region of the high-resolution image corresponding to the low-resolution image (the low-resolution image and the corresponding high-resolution image have the same image content, and the two images have different resolutions only), i.e., the image frequency of the partial region of the high-resolution image is high, and the region has more textures), the learning of the mapping relationship of the network is disturbed. Excessive improvement of the detail of the partial region can cause that any texture does not exist in other similar actual scenes, or a region with only weak texture generates pseudo texture in super-resolution reconstruction. Therefore, although the details of the obtained high-resolution image are improved, false information and false textures which are not consistent with the actual scene exist, the authenticity of the high-resolution image is greatly reduced, and the actual requirements of users cannot be met.
In summary, although the above-mentioned solution, even other super-resolution networks, can perform corresponding texture addition under the guidance of different semantic information, since the attention is focused on the improvement of details, pseudo textures that do not exist in the actual scene are easily generated under the interference of very weak texture regions. Therefore, in order to solve the problem that the current super-resolution technology of images does not meet the actual service requirements, the technical problems to be actually solved by the present application include the following aspects: based on the existing terminal equipment, the real reduction of the details and the textures of all parts in the low-resolution image is realized, the reality and the naturalness of the high-resolution image are ensured on the premise of improving the resolution, the actual scene is met, and the photographing experience of a user is improved.
Referring to fig. 3, fig. 3 is a functional block diagram of a terminal device according to an embodiment of the present disclosure. Alternatively, in one embodiment, the terminal device 100 may be configured in a fully or partially automatic shooting mode. For example, the terminal device 100 may be in a timed continuous automatic shooting mode, or an automatic shooting mode in which shooting is performed when a target object (e.g., a target building, a human face, etc.) set in advance is detected within a shooting range according to a computer instruction, or the like. When the terminal device 100 is in the automatic shooting mode, the terminal device 100 may be set to operate without interaction with a person.
The following specifically describes the embodiment by taking the terminal device 100 as an example. It should be understood that terminal device 100 may have more or fewer components than shown, may combine two or more components, or may have a different configuration of components. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
The terminal device 100 may include: the mobile terminal includes a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.
It is to be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation to the terminal device 100. In other embodiments of the present application, terminal device 100 may include more or fewer components than shown in fig. 3, or some components may be combined, some components may be split, or a different arrangement of components may be used, etc. The components shown in fig. 3 may be implemented in hardware, software, or a combination of software and hardware.
Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.
The controller may be a neural center and a command center of the terminal device 100, among others. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.
A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 may be a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses of instructions or data are avoided, and the waiting time of the processor 110 is reduced, so that the operating efficiency of the system can be greatly improved.
In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.
It should be understood that the interface connection relationship between the modules illustrated in the embodiment of the present application is only an exemplary illustration, and does not constitute a limitation on the structure of the terminal device 100. In other embodiments of the present application, the terminal device 100 may also adopt a different interface connection manner or a combination of a plurality of interface connection manners than those in the above embodiments.
The charging management module 140 is configured to receive charging input from a charger. The charger may be a wireless charger or a wired charger.
The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like.
The wireless communication function of the terminal device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.
The terminal device 100 implements a display function by the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), and the like. In some embodiments, the terminal device 100 may include 1 or more display screens 194.
The terminal device 100 may implement a shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194, the application processor, and the like. In some embodiments, the terminal device 100 page may include one or more cameras 193.
The ISP is used to process the data fed back by the camera 193. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness, contrast, human face skin color and the like of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193.
The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB or YUV format. In this embodiment of the application, a target image may be acquired by the camera 193, and the resolution of the target image may be lower (for example, when the terminal device 100 only supports image shooting with up to 300 ten thousand pixels, the resolution of the target image acquired by the camera is lower when shooting, for example, the target image may be less clear visually by a user). In some embodiments, the target image may also be acquired by more than one camera 193, and the like, which is not specifically limited in this embodiment. In some embodiments, the processor 110 may acquire a target image acquired by the camera 193, and then input the target image into a pre-trained semantic segmentation network to obtain a target semantic segmentation map (which may include one or more first semantic regions and one or more second semantic regions, for example) of the target image; then, the target image and the target semantic segmentation graph are input into a super-resolution network obtained through pre-training, and super-resolution processing is respectively carried out on different regions of the target image under the guidance and constraint of semantic information (for example, different textures are added to the different semantic regions), so that a super-resolution image corresponding to the target image is obtained, and the super-resolution image is rich and real in texture and natural and conforms to an actual scene. The camera 193 may be located on the front side of the terminal device 100, for example, above the touch screen, or may be located at another position, for example, on the back side of the terminal device. In addition, in some embodiments, the camera 193 may further include a camera, such as an infrared camera or other cameras, for capturing images required for face recognition. The camera for collecting the image required by face recognition is generally located on the front side of the terminal device 100, for example, above the touch screen, or may be located at another position, for example, the back side of the terminal device 100. In some embodiments, the terminal device 100 may also include other cameras. The terminal device 100 may also include a dot matrix emitter (not shown in fig. 3) for emitting light.
The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the terminal device 100 selects a frequency point, the digital signal processor is used to perform fourier transform or the like on the frequency point energy.
Video codecs are used to compress or decompress digital video. The terminal device 100 may support one or more video codecs. In this way, the terminal device 100 can play or record video in a plurality of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.
The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can implement applications such as intelligent recognition of the terminal device 100, for example: image processing such as image recognition, face recognition, voice recognition, text understanding, semantic segmentation, super-resolution reconstruction and the like.
The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the storage capability of the terminal device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music, video, photos, etc. are saved in an external memory card.
The internal memory 121 may be used to store computer-executable program code, which includes instructions. The processor 110 executes various functional applications of the terminal device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, applications required by at least one function (e.g., a face recognition function, a video recording function, a photographing function, an image processing function such as a semantic segmentation function and a super-resolution reconstruction function), and the like. The storage data area may store data created during use of the terminal device 100, and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.
The terminal device 100 may implement an audio function through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. Such as music playing, recording, etc.
The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal.
The speaker 170A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal.
The receiver 170B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal.
The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals.
The headphone interface 170D is used to connect a wired headphone. The headset interface 170D may be the USB interface 130, or may be an Open Mobile Terminal Platform (OMTP) standard interface of 3.5mm, or a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.
The pressure sensor 180A is used for sensing a pressure signal, and converting the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like.
The gyro sensor 180B may be used to determine the motion attitude of the terminal device 100. In some embodiments, the angular velocity of terminal device 100 about three axes (i.e., x, y, and z axes) may be determined by gyroscope sensor 180B.
The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode.
The ambient light sensor 180L is used to sense the ambient light level. The terminal device 100 may adaptively adjust the brightness of the display screen 194 according to the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust the white balance when taking a picture.
The fingerprint sensor 180H is used to collect a fingerprint. The terminal device 100 can utilize the collected fingerprint characteristics to realize fingerprint unlocking, access to an application lock, fingerprint photographing, fingerprint incoming call answering and the like. The fingerprint sensor 180H may be disposed below the touch screen, the terminal device 100 may receive a touch operation of a user on the touch screen in an area corresponding to the fingerprint sensor, and the terminal device 100 may collect fingerprint information of a finger of the user in response to the touch operation, so as to implement a related function.
The temperature sensor 180J is used to detect temperature. In some embodiments, the terminal device 100 executes a temperature processing policy using the temperature detected by the temperature sensor 180J.
The touch sensor 180K is also referred to as a "touch panel". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is used to detect a touch operation applied thereto or nearby. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided through the display screen 194. In other embodiments, the touch sensor 180K may be disposed on the surface of the terminal device 100, different from the position of the display screen 194.
The keys 190 include a power-on key, a volume key, and the like. The keys 190 may be mechanical keys. Or may be touch keys. The terminal device 100 may receive a key input, and generate a key signal input related to user setting and function control of the terminal device 100.
Indicator 192 may be an indicator light that may be used to indicate a state of charge, a change in charge, or a message, missed call, notification, etc.
The SIM card interface 195 is used to connect a SIM card. The SIM card can be brought into and out of contact with the terminal device 100 by being inserted into the SIM card interface 195 or being pulled out of the SIM card interface 195. In some embodiments, the terminal device 100 employs eSIM, namely: an embedded SIM card. The eSIM card may be embedded in the terminal device 100 and cannot be separated from the terminal device 100.
The terminal device 100 may be a video camera, a camcorder, a smart phone, a smart wearable device, a tablet computer, a laptop computer, a desktop computer, and the like, which have the above functions, and the embodiment of the present application is not particularly limited thereto.
The software system of the terminal device 100 may adopt a hierarchical architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the present application takes an Android system with a layered architecture as an example, and exemplarily illustrates a software structure of the terminal device 100.
Referring to fig. 4, fig. 4 is a block diagram of a software structure of a terminal device according to an embodiment of the present application. The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, an application layer, an application framework layer, an Android runtime (Android runtime) and system library, and a kernel layer from top to bottom.
The application layer may include a series of application packages.
As shown in fig. 4, the application package may include applications (also referred to as applications) such as camera, gallery, calendar, phone call, map, navigation, WLAN, bluetooth, music, video, short message, etc. The method can further comprise related image processing application related to the application, and the image processing application can be used for processing the original image by using the image processing method based on semantic segmentation, so that the high-resolution image with rich details, real texture and nature is obtained.
The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.
As shown in FIG. 4, the application framework layers may include a window manager, content provider, view system, phone manager, resource manager, notification manager, and the like.
The window manager is used for managing window programs. The window manager can obtain the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.
The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.
The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures. For example, in some embodiments, a photographing interface of a related high-definition photographing control may be included, and by clicking the high-definition photographing control, an original low-resolution image may be acquired and a series of processing (including semantic segmentation, super-resolution processing, and the like) on the original low-resolution image may be completed by using an image processing method based on semantic segmentation in the present application, so as to obtain a high-resolution image with rich details, real texture, and nature.
The phone manager is used to provide the communication function of the terminal device 100. Such as management of call status (including on, off, etc.).
The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.
The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a short dwell, and does not require user interaction. Such as a notification manager used to inform download completion, message alerts, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog interface. For example, text information is prompted in the status bar, a prompt tone is given, the terminal device vibrates, an indicator light flickers, and the like. For example, when performing high-definition photographing according to the present application, the user may be prompted by text information on a photographing end interface that the photographing is completed, and a high-resolution image obtained by processing an original low-resolution image by using an image processing method based on semantic segmentation in the present application may be generated. The high-definition photographing can also be performed, but when the memory of the terminal device 100 is insufficient, the user is prompted through the corresponding text information that the memory is insufficient, the photographing cannot be performed, and the like.
The Android Runtime comprises a core library and a virtual machine. The Android runtime is responsible for scheduling and managing an Android system.
The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.
The application layer and the application framework layer run in a virtual machine. And executing java files of the application program layer and the application program framework layer into a binary file by the virtual machine. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.
The system library may include a plurality of functional modules. For example: surface managers (surface managers), Media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., OpenGL ES), 2D graphics engines (e.g., SGL), and the like.
The surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications.
The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, and the like. The video formats referred to in this application may be, for example, RM, RMVB, MOV, MTV, AVI, AMV, DMV, FLV, etc.
The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.
The 2D graphics engine is a drawing engine for 2D drawing.
The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.
In order to facilitate understanding of the embodiments of the present application, the following exemplary list of application scenarios to which an image processing method based on semantic segmentation in the present application is applicable may include the following 2 scenarios.
In a first scene, terminal equipment acquires a target image through a camera and finally generates a super-resolution image of the target image.
Referring to fig. 5, fig. 5 is a schematic view of an application scenario of an image processing method based on semantic segmentation according to an embodiment of the present application, where the application scenario is a terminal device (for example, a smart phone in fig. 5). And the terminal device can comprise a relevant shooting module, a display, a processor and the like. The shooting module, the display and the processor can perform data transmission through a system bus. The shooting module may include cameras located on the front and/or back of the terminal device, and the cameras may convert captured light source signals into digital signals to complete acquisition of target images (i.e., to complete acquisition of original low-resolution images). And then transmitting the acquired target image to a processor through the system bus. The processor processes the target image by using the image processing method based on semantic segmentation in the application according to the acquired target image, for example, a series of processes including semantic segmentation, super-resolution reconstruction and the like are performed, so that the finally processed super-resolution image has real and natural rich textures, and meets the actual requirements of users.
In this embodiment of the present application, when a user wants to take a picture, the user may refer to fig. 6a, fig. 6b, and fig. 6c for an operation process of the terminal device, where fig. 6a to fig. 6c are a set of interface diagrams provided in this embodiment of the present application. As shown in fig. 6a, the terminal device displays a photographing interface 601, wherein the photographing interface 601 may include a photographing control 602, a front and back photographing control 603, an image library control 604, a photographing mode control group 605 (for example, including a large aperture control 605A, a night scene control 605B, a photographing control 605C, a recording control 605D, a portrait control 605E and a plurality of controls 605F), a setting control 606 and other controls (for example, a flash control and a magnification control, etc.). When the user wants to take a picture, the terminal device can be placed in a picture taking mode by clicking the picture taking control 605C, and then the picture taking can be started by inputting operation 607 (for example, clicking). Optionally, as shown in fig. 6a and 6b, the photographing interface 601 may further include a picture quality enhancement control 608. As shown in fig. 6b, when a user wants to capture an image with higher definition and richer details, the user may start the image quality enhancement function through an input operation 609 (e.g., clicking) so as to process the original low-resolution image (i.e., the target image) acquired by the camera during capturing based on the image processing method based on semantic segmentation provided by the present application, and finally obtain a super-resolution image with real and natural rich texture. As shown in fig. 6a, the image captured without the image quality enhancement function is low in resolution, blurred and short of details, while as shown in fig. 6b, the image captured after the image quality enhancement function is turned on is high in resolution, sharp and rich in details. Alternatively, the resolution of the image captured after the image quality enhancement function is turned on may be the same as the resolution of the image captured without turning on the image quality enhancement function, but the resolution is rich in details, and the embodiment of the present application is not limited to this. Optionally, when taking a picture in the enhanced image quality mode, the original low-resolution image collected by the camera may provide a preview in the foreground, and the preview is displayed on the display of the terminal device. Optionally, when the photographing is performed in the image quality enhancement mode, the terminal device may continuously acquire the original low-resolution image during the photographing process based on the operation of the user, and perform synchronous real-time processing on the original low-resolution image, that is, the terminal device directly generates a super-resolution image with real and natural texture when the photographing is finished each time. In addition, optionally, the terminal device may also perform only the capturing operation during the photographing process, and perform the related image processing operation (for example, operations including semantic segmentation and super-resolution reconstruction) asynchronously, and when the photographing is finished, the terminal device may save each original low-resolution image captured during the photographing process. Then, the terminal device may respond to the input operation of the user for the relevant control, and process the original low-resolution image to obtain a corresponding super-resolution image.
Alternatively, the user may set the resolution of the originally acquired low-resolution image (i.e., the target image) by performing operations related to the setting control 606 shown in fig. 6a and 6 b. As shown in fig. 6c, the terminal device displays a photo resolution setting interface 610, and the photo interface 610 may include a plurality of photo resolution controls (e.g., [4:3]40MP control 611, [4:3]10MP (recommended) control 612, [1:1]7MP control 613, [1:1]7MP control 614 and [ full screen ]6MP control 614 shown in fig. 6c, wherein [4:3], [1:1] and [ full screen ] are aspect ratios of photos, wherein 40MP is 4000 ten thousand pixels, 10MP is 1000 ten thousand pixels, 7MP is 700 thousand pixels, 6MP is 600 ten thousand pixels, and obviously, the resolution of 40MP is much greater than the resolution of 6 MP), and so on. The user can set his desired resolution by clicking on the corresponding photo resolution control, e.g., as shown in fig. 6c, the user can set the resolution of the original image captured by the camera to 10MP by an input operation 615 (e.g., clicking). It can be understood that the resolution of the photo is usually limited by the resolution that the terminal device can provide, as shown in fig. 6c, the terminal device can provide a resolution of 40MP at most, so that the user cannot obtain an image with a resolution higher than 40MP without turning on the image quality enhancement function, which seriously affects the photographing experience of the user. And under the condition of starting the image quality enhancement function, the user can obtain an image which has higher resolution than 40MP, abundant texture and real nature.
Alternatively, when the image quality enhancement function is turned on for photographing, the terminal device may store the unprocessed original low-resolution image and the super-resolution image obtained by processing the original low-resolution image into the image library at the same time, and the user may view all the photographed images (including, for example, a plurality of unprocessed original low-resolution images and a plurality of super-resolution images) by clicking the image library control 604. The user can also share, edit, collect, and delete the original low-resolution image and the super-resolution image respectively through a related sharing control, an editing control, a collection control, a deletion control (not shown in the figure), and the like. For example, if a user finds that the super-resolution image has a good effect, not only is clear and rich in details, but also the texture in the super-resolution image is real and natural and conforms to an actual scene after viewing the unprocessed original low-resolution image and the super-resolution image, the user may select to delete the original low-resolution image, and the like, which is not specifically limited in this application.
Optionally, in this embodiment of the application, when a developer wants to take a photo to test an image processing method based on semantic segmentation in the application, the developer may also refer to fig. 6a, 6b, and 6c for an operation process of the terminal device, which is not described herein again. Developers can continuously optimize the semantic segmentation network and the super-resolution network in the application according to the obtained super-resolution image, and the like, so that a better photographing effect is realized.
As described above, the terminal device may be a camera, a smart phone, a smart wearable device, a tablet computer, a laptop computer, a desktop computer, and the like, which have the functions of image acquisition, image processing (including semantic segmentation, super-resolution reconstruction, and the like), and display, and this is not limited in this embodiment.
And in a second scene, the terminal equipment is connected with the computing equipment, the target image which is acquired by the terminal equipment and sent to the computing equipment is processed by the computing equipment, and finally the super-resolution image of the target image is generated.
Referring to fig. 7, fig. 7 is a schematic view of an application scenario of another semantic segmentation based image processing method according to an embodiment of the present application, where the application scenario includes a terminal device (for example, a smart phone in fig. 7) and a computing device (for example, a desktop computer in fig. 7). The terminal device and the computing device can perform data transmission through wireless communication modes such as Bluetooth, Wi-Fi or a mobile network or wired communication modes such as a data line. The terminal device may include a related shooting module, a display, a processor, and the like. The shooting module, the display and the processor can perform data transmission through a system bus. The shooting module may include cameras located on the front and/or back of the terminal device, and the cameras may convert captured light source signals into digital signals to complete acquisition of target images (i.e., to complete acquisition of original low-resolution images).
The terminal device may then send the acquired raw low resolution image to the computing device via the wireless/wired communication described above. The computer device processes the acquired original low-resolution image by using an image processing method based on semantic segmentation in the present application according to the acquired original low-resolution image, and may include a series of processes such as semantic segmentation and super-resolution reconstruction as shown in fig. 7, for example. Finally, a high-resolution image satisfying the user's requirements is generated, for example, the super-resolution image shown in fig. 7 has a high resolution and has real and natural rich textures. For another example, a super-resolution image having a resolution equal to that of the original low-resolution image but having a real and natural rich texture may be generated. Further, the computing device can also store the processed super-resolution image locally in the computing device, and select to send the super-resolution image to the terminal device or other devices. Further, the terminal device may also select to send a plurality of original low-resolution images to the computing device, where the plurality of original low-resolution images may be a plurality of images that are obtained by shooting through the terminal device in advance and are stored locally in the terminal device. Then, the computing device can process the multiple original low-resolution images simultaneously or sequentially by using the image processing method based on semantic segmentation in the application, and finally generate multiple corresponding super-resolution images with real and natural rich textures, so that the actual requirements of users are met.
As described above, the terminal device may be a camera, a smart phone, a smart wearable device, a tablet computer, a laptop computer, a desktop computer, and the like, which have the above functions, and this is not particularly limited in this embodiment of the application; the computing device may be a smart phone, a smart wearable device, a tablet computer, a laptop computer, a desktop computer, and the like, which have the above functions, and this is not particularly limited in this embodiment of the application.
Referring to fig. 8, fig. 8 is a flowchart illustrating an image processing method based on semantic segmentation according to an embodiment of the present application, where the method is applicable to the application scenarios and the system architectures described in fig. 5 or fig. 7, and is specifically applicable to the terminal device 100 of fig. 3. The following description will be given taking the processor 110 whose execution main body is inside the terminal device 100 in fig. 3 as an example, with reference to fig. 8. The method may include the following steps S801 to S803.
In step S801, a target image is acquired.
Specifically, the terminal device starts the camera to shoot a scene in the visual field of the user, and the camera can convert the captured light source signal into a digital signal, so that the acquisition of a target image is completed. The processor inside the terminal device may then obtain the target image, the resolution of the target image depends on the resolution that can be supported by the terminal device (e.g. may be 20MP, 10MP, 7MP, 6MP, etc.), and the resolution of the target image may be lower (e.g. 7MP, 6MP, even 4MP, etc.). Optionally, the target image may be acquired by a front camera of the terminal device, or may be acquired by a rear camera of the terminal device, and generally, a maximum resolution supported by the front camera is generally smaller than a maximum resolution supported by the rear camera, which is not specifically limited in this embodiment of the present application. Optionally, the geographic location of the terminal device may be determined by a global positioning system inside the terminal device, and further, geographic area information of the target image may be determined, for example, a target geographic area to which the target image belongs. The target geographic area may be an area of a park, a mall, a school, a street, a stadium, a commercial street, a tourist attraction, and the like, which is previously divided by the developer, and this is not particularly limited in this embodiment of the present application. For example, on the basis of determining the geographic position and orientation of the terminal device, the target geographic area to which the target image belongs may be determined as class a, that is, the scene content included in the target image is determined as the class a. Optionally, the target image may also be subjected to image recognition by a pre-trained model (for example, a geographic area recognition model trained according to a large number of pictures taken for a plurality of geographic areas), a target geographic area to which the target image belongs is determined according to scene content included in the target image, and the like, which is not specifically limited in this embodiment of the present application.
Step S802, inputting the target image into a semantic segmentation network to obtain a target semantic segmentation map of the target image, wherein the target semantic segmentation map comprises K first semantic regions and P second semantic regions.
Specifically, after the processor in the terminal device obtains the target image, the processor may transmit the target image to the semantic segmentation network obtained by pre-training, so as to obtain the target semantic segmentation map of the target image. The target semantic segmentation map may include K first semantic regions and P second semantic regions. Optionally, each of the K first semantic regions may be a region segmented according to a preset semantic category (for example, common semantic categories such as sky, buildings, people, plants, animals, water surface, roads, bridges, vehicles, traffic signals, and so on). Each of the P second semantic regions may be a region in which the image frequency in the target image is less than a first preset value and which is used for texture addition in the super-resolution processing, that is, a weak texture region whose texture is difficult to recover. The K first semantic regions and the P second semantic regions are used for guiding and constraining the subsequent super-resolution reconstruction process of the target image, and interference among the semantic regions is reduced, so that the details and textures of the semantic regions are enhanced to different degrees, pseudo textures are avoided, and the obtained super-resolution image has real and natural rich textures. Wherein K, P may each be an integer greater than or equal to 1. Optionally, the target semantic segmentation graph may also include only K first semantic regions, and may also include only P second semantic regions, which is not specifically limited in this embodiment of the application.
Optionally, referring to fig. 9, fig. 9 is a schematic diagram of a process of performing super-resolution image reconstruction based on semantic segmentation according to an embodiment of the present application. As shown in fig. 9, a corresponding semantic segmentation network may be determined according to the geographic area information of the target image, for example, if the target geographic area to which the target image belongs is school a (that is, the target image is an image captured by shooting a scene in school a), the semantic segmentation network corresponding to school a may be selected to perform semantic segmentation on the target image; for another example, if the target geographic area to which the target image belongs is a B park (that is, the target image is an image captured for a scene in the B park), the corresponding semantic segmentation network for the B park may be selected to perform semantic segmentation on the target image. Therefore, the acquired target images of all the geographic regions can be subjected to semantic segmentation based on the semantic segmentation network aiming at the plurality of different geographic regions obtained through pre-training, so that the efficiency and the accuracy of the semantic segmentation are ensured, the quality of subsequent super-resolution reconstruction based on the semantic segmentation is improved, the quality of the obtained super-resolution images is ensured, and the photographing experience of a user is ensured. As shown in fig. 9, the target semantic segmentation graph obtained from the target image through the semantic segmentation network may include a plurality of semantic regions, specifically please refer to fig. 10, where fig. 10 is a schematic diagram of the target semantic segmentation graph provided in the embodiment of the present application. As shown in fig. 10, for the target image shown in fig. 9, the target semantic segmentation graph obtained after the semantic segmentation network may include a plurality of first semantic regions (e.g., the first semantic region 1 (sky), the first semantic region 2 (building), and the first semantic region 3 (plant) shown in fig. 10) and a second semantic region (e.g., the second semantic region 1 shown in fig. 10)
Optionally, the training process of the semantic segmentation network may include the following steps S11 to S12.
In step S11, a first set of images and a second set of images are acquired.
Specifically, the terminal device acquires a first image set and a second image set, the first image may include N first images (i.e., N low-resolution images), the second image set may include N second images (i.e., N high-resolution images), the N first images and the N second images correspond to each other one by one, and the resolution of each second image in the N second images is greater than the resolution of each corresponding first image in the N first images, i.e., each second image is a high-definition version of its corresponding first image, and the image contents of the two images are the same; n is an integer greater than or equal to 1. The second image and the first image corresponding to each other one by one may be obtained by shooting the terminal device at different resolutions for the same scene in the target geographic area. Optionally, the terminal device may capture a plurality of second images from a plurality of scenes in the target geographic area according to a higher resolution, perform downsampling processing on the plurality of second images through the terminal device or other devices to obtain a plurality of first images with a lower resolution, and the like, which is not specifically limited in this embodiment of the application.
Step S12, performing first semantic segmentation on the N first images according to preset semantic categories to obtain N first semantic segmentation maps corresponding to the N first images.
Specifically, different regions in each of the N first images can be framed and labeled according to preset semantic categories (e.g., common semantic categories such as sky, buildings, people, plants, animals, water surface, roads, bridges, vehicles, traffic signals, and the like) by semantic labeling tools in the prior art, for example, one or more sky regions, one or more building regions, one or more person regions, one or more plant regions, and the like in the first image are sequentially selected and subjected to corresponding semantic labeling (for example, the sky region is labeled with semantic information of the sky, the building region is labeled with semantic information of a building, the plant region is labeled with semantic information of plants, and the like), which is not described herein again, therefore, the first semantic segmentation of the N first images is completed, and N first semantic segmentation maps corresponding to the N first images are obtained. Each of the N first semantic segmentation maps comprises one or more first semantic regions segmented according to preset semantic categories;
step S13, performing frequency analysis on the N first images and the N second images to obtain N first frequency maps corresponding to the N first images and N second frequency maps corresponding to the N second images, respectively.
Specifically, the terminal device performs frequency resolution on each of the N first images, and performs frequency resolution on each of the N second images. Thus, N first frequency graphs corresponding to the N first images and N second frequency graphs corresponding to the N second images are obtained respectively. Referring to fig. 11, fig. 11 is a schematic diagram of frequency analysis provided in an embodiment of the present application, and as shown in fig. 11, a first frequency map corresponding to high and low frequencies of an image may be obtained by performing frequency analysis on a first image through a convolutional neural network, and optionally, a second frequency map corresponding to high and low frequencies of an image may also be obtained by performing frequency analysis on a second image through a convolutional neural network (not shown in fig. 11). Optionally, a frequency analysis may also be performed by using methods such as edge detection and Wavelet analysis (Wavelet), which is not specifically limited in this embodiment of the application.
And step S14, performing second semantic segmentation on the N first images according to preset conditions based on the N first frequency graphs and the N second frequency graphs to obtain N second semantic segmentation graphs corresponding to the N first images.
Specifically, the N first frequency maps and the N second frequency maps are compared in a one-to-one correspondence manner, one or more regions in which the difference value between the image frequency in the ith first frequency map of the N first frequency maps and the image frequency in the ith second frequency map of the N corresponding second frequency maps is greater than a second preset value are determined to be one or more second semantic regions, and each of the one or more regions is a region in which the image frequency in the ith first frequency map is less than the first preset value. Each of the one or more regions may also be a region in the ith first frequency map where the image frequency is less than other preset values.
Optionally, according to the one or more second semantic regions, performing second semantic segmentation on the ith first image corresponding to the ith first frequency map to obtain a second semantic segmentation map corresponding to the ith first image. i is an integer greater than or equal to 1 and less than or equal to N, as described above, N second semantic segmentation maps corresponding to the N first images can be obtained, and each of the N second semantic segmentation maps may include one or more first semantic regions and one or more second semantic regions. Optionally, part of the N second semantic segmentation maps may also include only one or more first semantic regions or only one or more second semantic regions, depending on the scene content contained in each first image.
For example, please refer to fig. 12, fig. 12 is a schematic diagram illustrating a process of obtaining a second semantic segmentation map according to an embodiment of the present disclosure. As shown in fig. 12, the first semantic segmentation map obtained by performing the first semantic segmentation according to the preset semantic category may include a first semantic region 4 (sky) and a first semantic region 5 (building). As shown in fig. 12, the difference between the image frequencies of the roof portion of the first image and the roof portion of the second image is large, for example, no texture is visible in the roof portion of the first image, but a tile texture clearly visible in the roof portion of the second image exists, and the roof portion is a weak texture region which is difficult to perform texture restoration in super-resolution reconstruction, and is a region (i.e., the second semantic region) which easily affects other weak texture regions to generate pseudo-textures. A developer may use an existing semantic annotation tool (for example, a common image semantic segmentation annotation tool labelme) to frame and label the roof region, thereby completing the second semantic segmentation of the first image, and the obtained second semantic segmentation map may include a first semantic region 4 (sky), a first semantic region 5 (building), and a second semantic region 2 (roof part) as shown in fig. 12.
And step S15, training to obtain a semantic segmentation network by taking N first images and N second semantic segmentation maps as training input and taking N second semantic segmentation maps as N labels.
Specifically, the terminal device obtains a first training sample set, where the first training sample set may include the N first images and the N second semantic segmentation maps. And taking the N first images and the N second semantic segmentation maps as training inputs, wherein the N second semantic segmentation maps are N labels, so that an initial neural network can be trained, one or more parameters in the initial neural network are continuously corrected, and finally, a semantic segmentation network for the target geographic area is obtained.
It should be noted that the training of the semantic segmentation network may also be completed by a computing device or other devices except the terminal device, which is not specifically limited in this embodiment of the present application.
As described above, according to the embodiment of the application, the target image can be subjected to semantic segmentation according to preset semantic categories (for example, common semantic categories such as sky, buildings, people, plants, animals, water surfaces, roads, bridges, vehicles, traffic signals, and the like), and the semantic segmentation is synchronous, and the target image can be subjected to semantic segmentation according to weak texture areas where textures are difficult to recover in the geographic area, so that a finer semantic segmentation map is obtained, that is, finer, richer and comprehensive semantic information is obtained. Therefore, the quality of super-resolution reconstruction of finer, richer and comprehensive semantic information obtained based on the semantic segmentation can be greatly improved, the quality of the obtained super-resolution image is ensured, the super-resolution image has real and natural rich textures, and the photographing experience of a user is improved.
Step S803, inputting the target semantic segmentation map and the target image into a super-resolution network, performing first super-resolution processing on the target image according to the K first semantic regions, and performing second super-resolution processing on the target image according to the P second semantic regions to obtain a super-resolution image corresponding to the target image.
Specifically, after obtaining the target semantic segmentation map of the target image through the semantic segmentation network, the terminal device may input the target semantic segmentation map and the target image to the super-resolution network, perform first super-resolution processing on the target image according to the K first semantic regions, and perform second super-resolution processing on the target image according to the P second semantic regions, to obtain a super-resolution image corresponding to the target image. Alternatively, the resolution of the super-resolution image may be greater than or equal to the resolution of the target image. For example, as shown in fig. 9, the target image and the target semantic segmentation map may be input together into a super-resolution network, and the obtained super-resolution image has a resolution greater than that of the target image and has a real and natural rich texture as shown in fig. 9. Alternatively, the resolution of the super-resolution image may be equal to that of the target image, but the super-resolution image has richer texture than the target image, and the texture is real and natural.
Alternatively, as shown in fig. 9, a corresponding super-resolution network may be determined according to the geographic area information of the target image, for example, if the target geographic area to which the target image belongs is school a (i.e., the target image is an image captured by shooting a scene in school a), the corresponding super-resolution network for school a may be selected to perform super-resolution reconstruction on the target image; for another example, if the target geographic area to which the target image belongs is a B park (that is, the target image is an image acquired by shooting a scene in the B park), a corresponding super-resolution network for the B park may be selected to perform super-resolution reconstruction on the target image, and so on, which is not described herein again. Therefore, the collected target images of all the geographic areas can be subjected to the super-resolution reconstruction based on semantic segmentation based on the pre-trained super-resolution network aiming at the multiple different geographic areas, the high efficiency and accuracy of the super-resolution reconstruction are guaranteed, the quality of the super-resolution images is improved, and the photographing experience of a user is guaranteed.
Optionally, the performing the first super-resolution processing on the target image according to the K first semantic regions may include: determining M first semantic regions in the K first semantic regions, and respectively adding textures to the corresponding M regions in the target image according to preset semantic categories corresponding to the M first semantic regions; each of the M first semantic regions is a region in which the image frequency in the target image is greater than or equal to the first preset value; determining Q first semantic regions in the K first semantic regions, and according to the Q first semantic regions, not adding textures to the corresponding Q regions in the target image; each of the Q first semantic areas is an area with the image frequency smaller than the first preset value in the target image; m, Q is an integer greater than or equal to 0 and the sum of M and Q is K.
Optionally, the performing of the second super-resolution processing on the target image according to the P second semantic regions may include: and respectively adding textures to the corresponding P areas in the target image according to the P second semantic areas. Optionally, for the second semantic region, that is, a region where there are abundant textures in the actual scene but there are almost no textures in the target image (low-resolution image) (for example, a region such as a roof, where there may be only one color block in the target image but there are textures such as tiles in the actual scene), matching the corresponding textures through a super-resolution network obtained through pre-training, and performing corresponding texture addition on the plurality of second semantic regions respectively, thereby recovering the real textures in the actual scene.
For example, for the target image and the target semantic segmentation map shown in fig. 9 and 10, the target semantic segmentation map includes three first semantic regions and one second semantic region, specifically, a first semantic region 1 (sky), a first semantic region 2 (building), a first semantic region 3 (plant), and a second semantic region 1. The second semantic area 1 is a building in front of a visual field in the target image, the image frequency of the second semantic area 1 in the target image is less than a first preset value, and only includes low-frequency information (for example, as shown in fig. 9, the second semantic area 1 hardly has any texture related to the building in the target image, such as a tile of an exterior wall, but obviously, the second semantic area 1 has a texture such as a tile of an exterior wall in an actual scene), and the super-resolution network obtained through the pre-training can match the corresponding texture and perform corresponding texture addition, so as to recover the real texture in the actual scene (such as a tile of an exterior wall in the building). Optionally, image frequencies of the first semantic region 1 (sky), the first semantic region 2 (building), and the first semantic region 3 (plant) in the target image may be different, and for a first semantic region whose image frequency is greater than or equal to a first preset value, texture addition may be performed on a corresponding region in the target image according to a semantic category corresponding to the first semantic region. For example, if the image frequency of the first semantic area 2 (building) and the first semantic area 3 (plant) is greater than a first preset value and includes part of high-frequency information, adding corresponding textures belonging to the class of buildings to the area corresponding to the first semantic area 2 (building) in the target image based on guidance and constraint of the semantic information of the buildings; and corresponding textures belonging to plants are added to the region corresponding to the first semantic region 3 (plant) in the target image based on the guidance and constraint of the semantic information of the plant, so that the added textures are more natural and real, the added textures conform to the actual scene, and the quality of the super-resolution image is greatly improved. On the other hand, for example, if the image frequency of the first semantic region 1 (sky) in the target image is less than the first predetermined value and only includes low-frequency information (for example, as shown in fig. 9, the first semantic region 1 (sky) is a blue-blue and cloud-free sky region, which is a region where no texture exists in the original actual scene or almost no texture exists, such as a cloud, sunset, or other texture), no texture addition (or only a moderately weak texture addition) may be performed on the region corresponding to the first semantic region 1 (sky) in the target image during super-resolution reconstruction, so that an excessive texture that does not conform to the actual scene in the super-resolution image is avoided (i.e., generation of a pseudo texture is avoided).
As described above, according to the embodiment of the application, the target image can be subjected to semantic segmentation according to preset semantic categories (for example, common semantic categories such as sky, buildings, people, plants, animals, water surfaces, roads, bridges, vehicles, traffic signals, and the like), and the semantic segmentation is synchronous, and the target image can be subjected to semantic segmentation according to weak texture areas where textures are difficult to recover in the geographic area, so that a finer semantic segmentation map is obtained, that is, finer, richer and comprehensive semantic information is obtained. Therefore, the quality of subsequent super-resolution reconstruction based on semantic segmentation can be greatly improved, for example, corresponding textures can be added to different semantic regions (for example, corresponding textures belonging to plants can be added to a plant region based on the guidance and constraint of the semantics of the plants, corresponding textures belonging to buildings can be added to a building region based on the guidance and constraint of the semantics of the buildings, and the like), the real textures of weak texture regions can be recovered, and the textures can also be not added to regions without textures in an original actual scene (i.e., texture enhancement is not performed blindly, and it is ensured that the processed images are consistent with the actual scene), so that the quality of the obtained super-resolution images is ensured, the super-resolution images have real and natural rich textures, and the photographing experience of users is improved.
Alternatively, the texture addition in the super-resolution reconstruction described above may be implemented by a feature extraction module, an upsampling module, and the like. Optionally, the feature extraction module may be configured to extract high-dimensional information of the target image, and the structure that is widely used in feature extraction at present includes a convolution network, a reserved Block (Res Block), a Dense Block (Dense Block), or a pairwise combination. Alternatively, the upsampling module may use a conventional upsampling method, for example, including a bicubic (bicubic), a bilinear (bilinear) and a nearest neighbor (nearest), and may also use a deconvolution method, a pixel permutation (pixel shuffle) method, and the like, which is not specifically limited in this embodiment of the present application.
Optionally, referring to fig. 13, fig. 13 is a schematic diagram of a training process of a super-resolution network according to an embodiment of the present application. The training process of the super-resolution network in the embodiment of the present application will be described below with reference to steps S11-S15 in the training process of the semantic segmentation network and fig. 13. The important components in the super-resolution network training process mainly comprise a feature extraction module, an up-sampling module and a loss module. As shown in fig. 13, N first images and N corresponding second semantic segmentation maps may be input into the initial neural network, feature extraction is performed on the N first images based on the N second semantic segmentation maps, and upsampling is performed on the N first images after feature extraction, so as to obtain N third images corresponding to the N first images. Then, taking the N second images as N labels, comparing the N labels with the N third images in a one-to-one correspondence manner, performing loss calculation on the N third images based on the N second semantic segmentation maps, correcting one or more parameters in the initial neural network, and finally obtaining a super-resolution network aiming at the target geographic area. Alternatively, as shown in fig. 13, the types of loss calculation may include Pixel loss (Pixel loss), resistance loss (GAN loss), and feature loss (VGG loss). In general, different texture types (i.e., different semantic classes) are often sensitive to loss to different degrees, and in this case, it is often difficult to globally call a loss matching that is suitable for all classes. As shown in fig. 13, in the presence of semantic information (i.e., the second semantic segmentation map), different loss ratios can be defined for each semantic category, global loss calculation is performed by counting pixel ratios of different semantic regions, so that training efficiency of the super-resolution network is greatly improved, the trained super-resolution network can perform more efficient and accurate super-resolution reconstruction on a low-resolution image based on the semantic information, it is ensured that the obtained super-resolution image has rich textures that are real, natural and conform to an actual scene, generation of redundant pseudo textures is avoided, and photographing experience of a user is improved.
Referring to fig. 14, fig. 14 is a schematic structural diagram of an image processing apparatus based on semantic segmentation according to an embodiment of the present application, where the image processing apparatus based on semantic segmentation may include an apparatus 30, and the apparatus 30 may include a first obtaining unit 309, a first semantic segmentation unit 310, and a super-resolution unit 311, where details of each unit are described below.
A first acquisition unit 309 configured to acquire a target image;
a first semantic segmentation unit 310, configured to input the target image to a semantic segmentation network to obtain a target semantic segmentation map of the target image, where the target semantic segmentation map includes K first semantic regions and P second semantic regions; each first semantic area in the K first semantic areas is an area obtained by segmentation according to a preset semantic category; each second semantic area in the P second semantic areas is an area, with the image frequency being smaller than a first preset value, of the target image and used for texture addition in the super-resolution processing; K. p is an integer greater than or equal to 1;
a super-resolution unit 311, configured to input the target semantic segmentation map and the target image into a super-resolution network, perform first super-resolution processing on the target image according to the K first semantic regions, and perform second super-resolution processing on the target image according to the P second semantic regions, so as to obtain a super-resolution image corresponding to the target image; the resolution of the super-resolution image is greater than or equal to the resolution of the target image.
In one possible implementation, the preset semantic category includes one or more of sky, buildings, people, plants, animals, water, roads, bridges, vehicles, and traffic signals; the super-resolution unit 311 is specifically configured to: determining M first semantic regions in the K first semantic regions, and adding textures to the corresponding M regions in the target image according to the preset semantic categories corresponding to the M first semantic regions respectively; each of the M first semantic regions is a region in the target image, wherein the image frequency of the region is greater than or equal to the first preset value; and respectively adding textures to the corresponding P areas in the target image according to the P second semantic areas.
In a possible implementation manner, the super-resolution unit 311 is further specifically configured to: determining Q first semantic regions in the K first semantic regions, and according to the Q first semantic regions, not adding textures to the corresponding Q regions in the target image; each of the Q first semantic regions is a region of which the image frequency in the target image is smaller than the first preset value; m, Q is an integer greater than or equal to 0 and the sum of M and Q is K.
In one possible implementation, the apparatus further includes:
a second obtaining unit 301, configured to obtain a first image set and a second image set, where the first image set includes N first images, the second image set includes N second images, the N first images are in one-to-one correspondence with the N second images, and a resolution of each of the N second images is greater than a resolution of each of the N first images; n is an integer greater than or equal to 1;
a second semantic segmentation unit 302, configured to perform first semantic segmentation on the N first images according to the preset semantic category to obtain N first semantic segmentation maps corresponding to the N first images; each of the N first semantic segmentation maps comprises one or more first semantic regions;
a frequency analyzing unit 303, configured to perform frequency analysis on the N first images and the N second images to obtain N first frequency maps corresponding to the N first images and N second frequency maps corresponding to the N second images, respectively;
a third semantic segmentation unit 304, configured to perform second semantic segmentation on the N first images according to preset conditions based on the N first frequency maps and the N second frequency maps to obtain N second semantic segmentation maps corresponding to the N first images, where each of the N second semantic segmentation maps includes the one or more first semantic regions and the one or more second semantic regions.
In a possible implementation manner, the third semantic segmentation unit 304 is specifically configured to include: comparing the N first frequency graphs with the N second frequency graphs in a one-to-one correspondence manner, and determining one or more regions, of which the difference values of the image frequency in the ith first frequency graph of the N first frequency graphs and the image frequency in the ith second frequency graph of the corresponding N second frequency graphs are greater than a second preset value, as one or more second semantic regions, wherein each region of the one or more regions is a region, of which the image frequency in the ith first frequency graph is less than the first preset value; performing second semantic segmentation on the ith first image corresponding to the ith first frequency map according to the one or more second semantic regions to obtain a second semantic segmentation map corresponding to the ith first image, wherein the second semantic segmentation map corresponding to the ith first image comprises the one or more first semantic regions and the one or more second semantic regions; i is an integer greater than or equal to 1 and less than or equal to N.
In one possible implementation, the apparatus further includes:
a third obtaining unit 305, configured to obtain a first training sample set, where the first training sample set includes the N first images and the N second semantic segmentation maps;
the first training unit 306 is configured to train to obtain the semantic segmentation network by using the N first images and the N second semantic segmentation maps as training inputs and using the N second semantic segmentation maps as N labels.
In one possible implementation, the apparatus further includes:
a fourth obtaining unit 307, configured to obtain a second training sample set, where the second training sample set includes the N first images, the N second semantic segmentation maps, and the N second images;
the second training unit 308 is configured to train to obtain the super-resolution network by taking the N first images, the N second semantic segmentation maps, and the N second images as training inputs and taking the N second images as N labels.
In a possible implementation manner, the second training unit 308 is specifically configured to: inputting the N first images and the N second semantic segmentation maps into an initial neural network, extracting the features of the N first images based on the N second semantic segmentation maps, and performing upsampling processing on the N first images after feature extraction to obtain N third images corresponding to the N first images; and taking the N second images as N labels, performing loss calculation on the N third images based on the N second semantic segmentation graphs, and correcting one or more parameters in the initial neural network to obtain the super-resolution network.
In a possible implementation manner, the target image, the N first images, and the N second images are images obtained by shooting for a scene in a target geographic area; the semantic segmentation network is a semantic segmentation network aiming at the target geographic area, and the super-resolution network is a super-resolution network aiming at the target geographic area.
It should be noted that, for the functions of each functional unit in the image processing apparatus based on semantic segmentation described in the embodiment of the present application, reference may be made to the description of step S801 to step S803 in the embodiment of the method described in fig. 8, and details are not repeated here.
Each of the units in fig. 14 may be implemented in software, hardware, or a combination thereof. The unit implemented in hardware may include a circuit and a furnace, an arithmetic circuit, an analog circuit, or the like. A unit implemented in software may comprise program instructions, considered as a software product, stored in a memory and executable by a processor to perform the relevant functions, see in particular the previous description.
Based on the description of the method embodiment and the apparatus embodiment, the embodiment of the present application further provides a terminal device. Referring to fig. 15, fig. 15 is a schematic structural diagram of a terminal device according to an embodiment of the present application, where the terminal device includes at least a processor 401, an input device 402, an output device 403, and a computer-readable storage medium 404, and the terminal device may further include other general components, which are not described in detail herein. Wherein the processor 401, input device 402, output device 403 and computer readable storage medium 404 within the terminal device may be connected by a bus or other means.
The processor 401 may be a general purpose Central Processing Unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to control the execution of programs according to the above schemes.
The Memory in the terminal device may be a Read-Only Memory (ROM) or other types of static Memory devices capable of storing static information and instructions, a Random Access Memory (RAM) or other types of dynamic Memory devices capable of storing information and instructions, an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Compact Disc Read-Only Memory (CD-ROM) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.
A computer-readable storage medium 404 may be stored in the memory of the terminal device, said computer-readable storage medium 404 being adapted to store a computer program comprising program instructions, said processor 401 being adapted to execute the program instructions stored by said computer-readable storage medium 404. The processor 401 (or CPU) is a computing core and a control core of the terminal device, and is adapted to implement one or more instructions, and specifically, adapted to load and execute one or more instructions to implement corresponding method flows or corresponding functions; in one embodiment, the processor 401 according to the embodiment of the present application may be configured to perform a series of processes of image processing based on semantic segmentation, including: acquiring a target image; inputting the target image into a semantic segmentation network to obtain a target semantic segmentation map of the target image, wherein the target semantic segmentation map comprises K first semantic regions and P second semantic regions; each first semantic area in the K first semantic areas is an area obtained by segmentation according to a preset semantic category; each second semantic area in the P second semantic areas is an area, with the image frequency being smaller than a first preset value, of the target image and used for texture addition in the super-resolution processing; K. p is an integer greater than or equal to 1; inputting the target semantic segmentation map and the target image into a super-resolution network, performing first super-resolution processing on the target image according to the K first semantic regions, and performing second super-resolution processing on the target image according to the P second semantic regions to obtain a super-resolution image corresponding to the target image; the resolution of the super-resolution image is greater than or equal to the resolution of the target image, and so on.
It should be noted that, for the functions of each functional unit in the terminal device described in the embodiment of the present application, reference may be made to the related description of step S801 to step S803 in the method embodiment described in fig. 8, which is not described herein again.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
An embodiment of the present application further provides a computer-readable storage medium (Memory), which is a Memory device in the terminal device and is used for storing programs and data. It is understood that the computer readable storage medium herein may include a built-in storage medium in the terminal device, and may also include an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space that stores an operating system of the terminal device. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor 401. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one computer readable storage medium remotely located from the aforementioned processor.
Embodiments of the present application also provide a computer program, which includes instructions that, when executed by a computer, enable the computer to perform some or all of the steps of any one of the image processing methods based on semantic segmentation.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, and may specifically be a processor in the computer device) to execute all or part of the steps of the above-mentioned method of the embodiments of the present application. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a magnetic disk, an optical disk, a Read-only memory (ROM) or a Random Access Memory (RAM).
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (21)

1. An image processing method based on semantic segmentation is characterized by comprising the following steps:
acquiring a target image;
inputting the target image into a semantic segmentation network to obtain a target semantic segmentation map of the target image, wherein the target semantic segmentation map comprises K first semantic regions and P second semantic regions; each first semantic area in the K first semantic areas is an area obtained by segmentation according to a preset semantic category; each second semantic area in the P second semantic areas is an area, with the image frequency being smaller than a first preset value, of the target image and used for texture addition in the super-resolution processing; K. p is an integer greater than or equal to 1;
inputting the target semantic segmentation map and the target image into a super-resolution network, performing first super-resolution processing on the target image according to the K first semantic regions, and performing second super-resolution processing on the target image according to the P second semantic regions to obtain a super-resolution image corresponding to the target image; the resolution of the super-resolution image is greater than or equal to the resolution of the target image.
2. The method of claim 1, wherein the preset semantic categories include one or more of sky, buildings, people, plants, animals, water, roads, bridges, vehicles, and traffic signals; the performing a first super-resolution process on the target image according to the K first semantic regions and performing a second super-resolution process on the target image according to the P second semantic regions includes:
determining M first semantic regions in the K first semantic regions, and adding textures to the corresponding M regions in the target image according to the preset semantic categories corresponding to the M first semantic regions respectively; each of the M first semantic regions is a region in the target image, wherein the image frequency of the region is greater than or equal to the first preset value;
and respectively adding textures to the corresponding P areas in the target image according to the P second semantic areas.
3. The method of claim 2, further comprising:
determining Q first semantic regions in the K first semantic regions, and according to the Q first semantic regions, not adding textures to the corresponding Q regions in the target image; each of the Q first semantic regions is a region of which the image frequency in the target image is smaller than the first preset value; m, Q is an integer greater than or equal to 0 and the sum of M and Q is K.
4. The method according to any one of claims 1-3, further comprising:
acquiring a first image set and a second image set, wherein the first image set comprises N first images, the second image set comprises N second images, the N first images are in one-to-one correspondence with the N second images, and the resolution of each second image in the N second images is greater than that of each corresponding first image in the N first images; n is an integer greater than or equal to 1;
performing first semantic segmentation on the N first images according to the preset semantic categories to obtain N first semantic segmentation images corresponding to the N first images; each of the N first semantic segmentation maps comprises one or more first semantic regions;
frequency analysis is carried out on the N first images and the N second images, and N first frequency graphs corresponding to the N first images and N second frequency graphs corresponding to the N second images are obtained respectively;
and performing second semantic segmentation on the N first images according to preset conditions based on the N first frequency graphs and the N second frequency graphs to obtain N second semantic segmentation graphs corresponding to the N first images, wherein each second semantic segmentation graph in the N second semantic segmentation graphs comprises one or more first semantic regions and one or more second semantic regions.
5. The method according to claim 4, wherein the performing second semantic segmentation on the N first images according to a preset condition based on the N first frequency maps and the N second frequency maps to obtain N second semantic segmentation maps corresponding to the N first images includes:
comparing the N first frequency graphs with the N second frequency graphs in a one-to-one correspondence manner, and determining one or more regions, of which the difference values of the image frequency in the ith first frequency graph of the N first frequency graphs and the image frequency in the ith second frequency graph of the corresponding N second frequency graphs are greater than a second preset value, as one or more second semantic regions, wherein each region of the one or more regions is a region, of which the image frequency in the ith first frequency graph is less than the first preset value;
performing second semantic segmentation on the ith first image corresponding to the ith first frequency map according to the one or more second semantic regions to obtain a second semantic segmentation map corresponding to the ith first image, wherein the second semantic segmentation map corresponding to the ith first image comprises the one or more first semantic regions and the one or more second semantic regions; i is an integer greater than or equal to 1 and less than or equal to N.
6. The method according to any one of claims 4-5, further comprising:
acquiring a first training sample set, wherein the first training sample set comprises the N first images and the N second semantic segmentation maps;
and training to obtain the semantic segmentation network by taking the N first images and the N second semantic segmentation maps as training input and taking the N second semantic segmentation maps as N labels.
7. The method according to any one of claims 4-6, further comprising:
acquiring a second training sample set, wherein the second training sample set comprises the N first images, the N second semantic segmentation graphs and the N second images;
and training to obtain the super-resolution network by taking the N first images, the N second semantic segmentation graphs and the N second images as training input and taking the N second images as N labels.
8. The method of claim 7, wherein the training with the N first images, the N second semantic segmentation maps, and the N second images as training inputs and the N second images as N labels to obtain the super-resolution network comprises:
inputting the N first images and the N second semantic segmentation maps into an initial neural network, extracting the features of the N first images based on the N second semantic segmentation maps, and performing upsampling processing on the N first images after feature extraction to obtain N third images corresponding to the N first images;
and taking the N second images as N labels, performing loss calculation on the N third images based on the N second semantic segmentation graphs, and correcting one or more parameters in the initial neural network to obtain the super-resolution network.
9. The method according to any one of claims 4-8, wherein the target image, the N first images and the N second images are images taken of a scene within a target geographic area; the semantic segmentation network is a semantic segmentation network aiming at the target geographic area, and the super-resolution network is a super-resolution network aiming at the target geographic area.
10. An image processing apparatus based on semantic segmentation, comprising:
the first acquisition module is used for acquiring a target image;
the first semantic segmentation module is used for inputting the target image into a semantic segmentation network to obtain a target semantic segmentation map of the target image, wherein the target semantic segmentation map comprises K first semantic regions and P second semantic regions; each first semantic area in the K first semantic areas is an area obtained by segmentation according to a preset semantic category; each second semantic area in the P second semantic areas is an area, with the image frequency being smaller than a first preset value, of the target image and used for texture addition in the super-resolution processing; K. p is an integer greater than or equal to 1;
the super-resolution module is used for inputting the target semantic segmentation map and the target image into a super-resolution network, performing first super-resolution processing on the target image according to the K first semantic regions, and performing second super-resolution processing on the target image according to the P second semantic regions to obtain a super-resolution image corresponding to the target image; the resolution of the super-resolution image is greater than or equal to the resolution of the target image.
11. The apparatus of claim 10, wherein the preset semantic categories include one or more of sky, buildings, people, plants, animals, water, roads, bridges, vehicles, and traffic signals; the super-resolution module is specifically used for:
determining M first semantic regions in the K first semantic regions, and adding textures to the corresponding M regions in the target image according to the preset semantic categories corresponding to the M first semantic regions respectively; each of the M first semantic regions is a region in the target image, wherein the image frequency of the region is greater than or equal to the first preset value;
and respectively adding textures to the corresponding P areas in the target image according to the P second semantic areas.
12. The apparatus of claim 11, wherein the super-resolution module is further specifically configured to:
determining Q first semantic regions in the K first semantic regions, and according to the Q first semantic regions, not adding textures to the corresponding Q regions in the target image; each of the Q first semantic regions is a region of which the image frequency in the target image is smaller than the first preset value; m, Q is an integer greater than or equal to 0 and the sum of M and Q is K.
13. The apparatus of any one of claims 10-12, further comprising:
a second obtaining module, configured to obtain a first image set and a second image set, where the first image set includes N first images, the second image set includes N second images, the N first images are in one-to-one correspondence with the N second images, and a resolution of each of the N second images is greater than a resolution of each of the N corresponding first images; n is an integer greater than or equal to 1;
the second semantic segmentation module is used for performing first semantic segmentation on the N first images according to the preset semantic categories to obtain N first semantic segmentation maps corresponding to the N first images; each of the N first semantic segmentation maps comprises one or more first semantic regions;
a frequency analysis module, configured to perform frequency analysis on the N first images and the N second images to obtain N first frequency graphs corresponding to the N first images and N second frequency graphs corresponding to the N second images, respectively;
and the third semantic segmentation module is used for performing second semantic segmentation on the N first images according to preset conditions based on the N first frequency graphs and the N second frequency graphs to obtain N second semantic segmentation graphs corresponding to the N first images, wherein each second semantic segmentation graph in the N second semantic segmentation graphs comprises the one or more first semantic regions and the one or more second semantic regions.
14. The apparatus of claim 13, wherein the third semantic segmentation module is specifically configured to include:
comparing the N first frequency graphs with the N second frequency graphs in a one-to-one correspondence manner, and determining one or more regions, of which the difference values of the image frequency in the ith first frequency graph of the N first frequency graphs and the image frequency in the ith second frequency graph of the corresponding N second frequency graphs are greater than a second preset value, as one or more second semantic regions, wherein each region of the one or more regions is a region, of which the image frequency in the ith first frequency graph is less than the first preset value;
performing second semantic segmentation on the ith first image corresponding to the ith first frequency map according to the one or more second semantic regions to obtain a second semantic segmentation map corresponding to the ith first image, wherein the second semantic segmentation map corresponding to the ith first image comprises the one or more first semantic regions and the one or more second semantic regions; i is an integer greater than or equal to 1 and less than or equal to N.
15. The apparatus of any one of claims 13-14, further comprising:
a third obtaining module, configured to obtain a first training sample set, where the first training sample set includes the N first images and the N second semantic segmentation maps;
and the first training module is used for training to obtain the semantic segmentation network by taking the N first images and the N second semantic segmentation maps as training input and taking the N second semantic segmentation maps as N labels.
16. The apparatus of any one of claims 13-15, further comprising:
a fourth obtaining module, configured to obtain a second training sample set, where the second training sample set includes the N first images, the N second semantic segmentation maps, and the N second images;
and the second training module is used for training to obtain the super-resolution network by taking the N first images, the N second semantic segmentation graphs and the N second images as training input and taking the N second images as N labels.
17. The apparatus of claim 16, wherein the second training module is specifically configured to:
inputting the N first images and the N second semantic segmentation maps into an initial neural network, extracting the features of the N first images based on the N second semantic segmentation maps, and performing upsampling processing on the N first images after feature extraction to obtain N third images corresponding to the N first images;
and taking the N second images as N labels, performing loss calculation on the N third images based on the N second semantic segmentation graphs, and correcting one or more parameters in the initial neural network to obtain the super-resolution network.
18. The apparatus according to any one of claims 13-17, wherein the target image, the N first images and the N second images are images captured for a scene within a target geographic area; the semantic segmentation network is a semantic segmentation network aiming at the target geographic area, and the super-resolution network is a super-resolution network aiming at the target geographic area.
19. A terminal device comprising a processor and a memory, the processor being coupled to the memory, wherein the memory is configured to store program code and the processor is configured to invoke the program code to perform the method of any of claims 1 to 9.
20. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1 to 9.
21. A computer program, characterized in that the computer program comprises instructions which, when executed by a computer, cause the computer to carry out the method according to any one of claims 1 to 9.
CN202010313277.5A 2020-04-20 Image processing method based on semantic segmentation and related equipment Active CN113538227B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010313277.5A CN113538227B (en) 2020-04-20 Image processing method based on semantic segmentation and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010313277.5A CN113538227B (en) 2020-04-20 Image processing method based on semantic segmentation and related equipment

Publications (2)

Publication Number Publication Date
CN113538227A true CN113538227A (en) 2021-10-22
CN113538227B CN113538227B (en) 2024-04-12

Family

ID=

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115147628A (en) * 2022-09-06 2022-10-04 深圳市明源云科技有限公司 House image data processing method and device, terminal equipment and medium
CN115601688A (en) * 2022-12-15 2023-01-13 中译文娱科技(青岛)有限公司(Cn) Video main content detection method and system based on deep learning
CN116029951A (en) * 2022-05-27 2023-04-28 荣耀终端有限公司 Image processing method and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108335306A (en) * 2018-02-28 2018-07-27 北京市商汤科技开发有限公司 Image processing method and device, electronic equipment and storage medium
CN109191392A (en) * 2018-08-09 2019-01-11 复旦大学 A kind of image super-resolution reconstructing method of semantic segmentation driving
US20190295302A1 (en) * 2018-03-22 2019-09-26 Northeastern University Segmentation Guided Image Generation With Adversarial Networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108335306A (en) * 2018-02-28 2018-07-27 北京市商汤科技开发有限公司 Image processing method and device, electronic equipment and storage medium
US20190295302A1 (en) * 2018-03-22 2019-09-26 Northeastern University Segmentation Guided Image Generation With Adversarial Networks
CN109191392A (en) * 2018-08-09 2019-01-11 复旦大学 A kind of image super-resolution reconstructing method of semantic segmentation driving

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116029951A (en) * 2022-05-27 2023-04-28 荣耀终端有限公司 Image processing method and electronic equipment
CN115147628A (en) * 2022-09-06 2022-10-04 深圳市明源云科技有限公司 House image data processing method and device, terminal equipment and medium
CN115147628B (en) * 2022-09-06 2022-12-02 深圳市明源云科技有限公司 House image data processing method and device, terminal equipment and medium
CN115601688A (en) * 2022-12-15 2023-01-13 中译文娱科技(青岛)有限公司(Cn) Video main content detection method and system based on deep learning

Similar Documents

Publication Publication Date Title
US11949978B2 (en) Image content removal method and related apparatus
WO2021078001A1 (en) Image enhancement method and apparatus
CN111738122B (en) Image processing method and related device
CN113706414B (en) Training method of video optimization model and electronic equipment
WO2023284715A1 (en) Object reconstruction method and related device
CN115061770B (en) Method and electronic device for displaying dynamic wallpaper
WO2023093169A1 (en) Photographing method and electronic device
CN113536866A (en) Character tracking display method and electronic equipment
WO2022156473A1 (en) Video playing method and electronic device
CN116916151B (en) Shooting method, electronic device and storage medium
WO2021180046A1 (en) Image color retention method and device
CN114979457B (en) Image processing method and related device
CN113099146B (en) Video generation method and device and related equipment
CN113538227B (en) Image processing method based on semantic segmentation and related equipment
CN115115679A (en) Image registration method and related equipment
CN113538227A (en) Image processing method based on semantic segmentation and related equipment
CN114793283A (en) Image encoding method, image decoding method, terminal device, and readable storage medium
CN116453131B (en) Document image correction method, electronic device and storage medium
CN116152123B (en) Image processing method, electronic device, and readable storage medium
CN116757963B (en) Image processing method, electronic device, chip system and readable storage medium
WO2023216957A1 (en) Target positioning method and system, and electronic device
US20240046504A1 (en) Image processing method and electronic device
CN116193275B (en) Video processing method and related equipment
CN116828099B (en) Shooting method, medium and electronic equipment
CN114697525B (en) Method for determining tracking target and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant