CN111968064B

CN111968064B - Image processing method and device, electronic equipment and storage medium

Info

Publication number: CN111968064B
Application number: CN202011136482.5A
Authority: CN
Inventors: 王树; 罗振波; 吉翔
Original assignee: Chengdu Ruiyan Technology Co ltd
Current assignee: Chengdu Ruiyan Technology Co ltd
Priority date: 2020-10-22
Filing date: 2020-10-22
Publication date: 2021-01-15
Anticipated expiration: 2040-10-22
Also published as: CN111968064A

Abstract

The application provides an image processing method, an image processing device, an electronic device and a storage medium, wherein the method comprises the following steps: obtaining an image to be processed; carrying out feature extraction on an image to be processed to obtain image features; extracting a first feature from the image features by using a super-resolution model, and extracting a second feature from the image features by using a target detection network model; and fusing the semantic and detail levels of the first feature and the second feature to obtain a fused feature, and performing computer vision processing on the fused feature. In the implementation process, the super-resolution feature extracted from the image feature and the target detection feature are fused, and then computer vision processing is performed according to the obtained fusion feature, so that the complementarity of the super-resolution feature and the target detection feature on semantic information and detail information is fully utilized to perform computer vision related tasks such as super-resolution reconstruction or target detection, and the effect of computer vision processing on the image is effectively improved.

Description

Image processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer vision, target detection and super-resolution technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

Computer vision (Computer vision) is a science for researching how to make a machine "see", that is, a camera and a Computer are used for replacing human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, and the image is processed into an image more suitable for human eyes to observe or transmitted to an instrument to detect by the Computer; wherein, the related tasks of computer vision include: super-resolution reconstruction, target detection and the like.

In particular practice, it has been found that the image acquisition process is often subject to various limitations of the actual environment, such as: the image acquired by using the camera with the ordinary resolution is low in resolution, or the size of the target object in the acquired image is small. At present, the effect of computer vision processing on an image with limitation is poor, and the specific examples are as follows: performing super-resolution reconstruction on an image with lower resolution, wherein the obtained image is easy to have some detail texture errors; another example is: when the target detection processing is performed on an image having a small target object size, a recognition failure or a recognition error is likely to occur.

Disclosure of Invention

An object of the embodiments of the present application is to provide an image processing method, an image processing apparatus, an electronic device, and a storage medium, which are used to solve the problem that an effect of performing computer vision processing on an image with limitations is not good.

The embodiment of the application provides an image processing method, which comprises the following steps: obtaining an image to be processed; extracting bottom layer features of an image to be processed to obtain image bottom layer features; extracting a first feature from the image features by using a super-resolution model, and extracting a second feature from the image features by using a target detection network model; and fusing the semantic and detail levels of the first feature and the second feature to obtain a fused feature, and performing computer vision processing on the fused feature. In the implementation process, the super-resolution feature extracted from the task-oriented image feature and the target detection feature are fused, and then computer vision processing is performed according to the obtained fusion feature, so that computer vision-related tasks such as super-resolution reconstruction or target detection are performed by fully utilizing the complementarity of the super-resolution feature and the target detection feature on the semantic information and detail information level, and the effect of computer vision processing on the image is effectively improved.

Optionally, in an embodiment of the present application, the computer vision processing includes: reconstructing super-resolution; fusing the first feature and the second feature to obtain a fused feature, and performing computer vision processing on the fused feature, including: upsampling the second feature to obtain an upsampled feature; performing feature transformation on the up-sampling feature by using a first convolution module to obtain a semantic feature; fusing the first feature and the semantic feature to obtain a fused feature; and performing super-resolution reconstruction on the fusion features to obtain a super-resolution image. In the implementation process, the first feature related to the super-resolution task and the semantic feature obtained after the up-sampling feature is transformed are fused to obtain a fusion feature; performing super-resolution reconstruction on the fusion features to obtain a super-resolution image; therefore, the task of super-resolution reconstruction is effectively completed by utilizing detail information in the first feature and semantic information in the semantic features after the up-sampling feature transformation, and the accuracy of super-resolution reconstruction of the image to be processed is improved.

Optionally, in this embodiment of the present application, fusing the first feature and the semantic feature includes: and performing channel attention fusion on the first feature and the semantic feature, or performing space attention fusion on the first feature and the semantic feature.

Optionally, in this embodiment of the present application, after obtaining the super-resolution image, the method further includes: and carrying out target detection processing on the super-resolution image to obtain a target detection result. In the implementation process, the first feature related to the super-resolution task and the semantic feature obtained after the up-sampling feature is transformed are fused to obtain a fusion feature; performing super-resolution reconstruction on the fusion features to obtain a super-resolution image; finally, performing target detection processing on the super-resolution image to obtain a target detection result; the definition of the super-resolution image is improved by effectively utilizing the fused semantic features, so that the accuracy of target detection is improved.

Optionally, in an embodiment of the present application, the computer vision processing includes: target detection processing; performing semantic and detail level fusion on the first feature and the second feature to obtain a fusion feature, and performing computer vision processing on the fusion feature, wherein the method comprises the following steps: performing feature transformation on the first features by using a second convolution module to obtain detailed features; fusing the detail features and the second features to obtain fused features; and carrying out target detection processing on the fusion characteristics to obtain a target detection result. In the implementation process, a detail feature obtained by performing feature transformation on the first feature is fused with a second feature related to the target detection task to obtain a fused feature; then, carrying out target detection processing on the fusion characteristics to obtain a target detection result; therefore, the fused detail characteristics are effectively utilized to carry out target detection processing, and the accuracy of target detection is improved.

Optionally, in an embodiment of the present application, the fusing the detail feature and the second feature includes: and performing channel fusion on the detail characteristic and the second characteristic, or performing spatial attention fusion on the detail characteristic and the second characteristic.

Optionally, in this embodiment of the present application, the target detection network model includes: frame regression branches and classification branches; and carrying out target detection processing on the fusion features, wherein the target detection processing comprises the following steps: predicting all candidate frames in the fusion characteristics by using frame regression branches, wherein the candidate frames represent and predict the region containing the target object in the image to be processed; and classifying the candidate frame by using the classification branch to obtain the target class of the target object corresponding to the candidate frame and the probability corresponding to the target class.

An embodiment of the present application further provides an image processing apparatus, including: the processing image obtaining module is used for obtaining an image to be processed; the image bottom layer feature obtaining module is used for extracting bottom layer features of the image to be processed to obtain image bottom layer features; the image feature extraction module is used for extracting a first feature from the image features by using a super-resolution model and extracting a second feature from the image features by using a target detection network model; and the fusion visual processing module is used for fusing the semantics and the detail level of the first characteristic and the second characteristic to obtain fusion characteristics and carrying out computer visual processing on the fusion characteristics. In the implementation process, the super-resolution feature extracted from the image feature and the target detection feature are fused, and then computer vision processing is performed according to the obtained fusion feature, so that the complementarity of the super-resolution feature and the target detection feature on semantic information and detail information is fully utilized to perform computer vision related tasks such as super-resolution reconstruction or target detection, and the effect of computer vision processing on the image is effectively improved.

Optionally, in an embodiment of the present application, the computer vision processing includes: reconstructing super-resolution; a fusion vision processing module comprising: the sampling feature obtaining module is used for performing up-sampling on the second feature to obtain an up-sampling feature; the semantic feature obtaining module is used for performing feature transformation on the up-sampling features by using the first convolution module to obtain semantic features; the first feature fusion module is used for fusing the first feature and the semantic feature to obtain a fusion feature; and the super-resolution reconstruction module is used for performing super-resolution reconstruction on the fusion features to obtain a super-resolution image.

Optionally, in an embodiment of the present application, the first feature fusion module includes: and the semantic feature fusion module is used for performing channel attention fusion on the first feature and the semantic feature, or performing space attention fusion on the first feature and the semantic feature.

Optionally, in an embodiment of the present application, the fusion vision processing module further includes: and the first target detection module is used for carrying out target detection processing on the super-resolution image to obtain a target detection result.

Optionally, in an embodiment of the present application, the computer vision processing includes: target detection processing; a fusion vision processing module comprising: the detail feature obtaining module is used for performing feature transformation on the first features by using the second convolution module to obtain detail features; the second feature fusion module is used for fusing the detail features and the second features to obtain fusion features; and the second target detection module is used for carrying out target detection processing on the fusion characteristics to obtain a target detection result.

Optionally, in an embodiment of the present application, the second feature fusion module includes: and the detail feature fusion module is used for performing channel fusion on the detail features and the second features or performing spatial attention fusion on the detail features and the second features.

Optionally, in this embodiment of the present application, the target detection network model includes: frame regression branches and classification branches; a second target detection module comprising: the candidate frame prediction module is used for predicting all candidate frames in the fusion characteristics by using frame regression branches, and the candidate frame representation predicts the region containing the target object in the image to be processed; and the candidate frame classification module is used for classifying the candidate frames by using the classification branches to obtain the target classes of the target objects corresponding to the candidate frames and the probabilities corresponding to the target classes.

An embodiment of the present application further provides an electronic device, including: a processor and a memory, the memory storing processor-executable machine-readable instructions, the machine-readable instructions when executed by the processor performing the method as described above.

Embodiments of the present application also provide a storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform the method as described above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic flow chart of an image processing method provided in an embodiment of the present application;

FIG. 2 is a process diagram of computer vision processing provided by an embodiment of the present application;

fig. 3 is a network structure diagram of a residual dense module in residuals according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a super-resolution reconstruction process provided by an embodiment of the present application;

FIG. 5 is a schematic process diagram of the target detection process provided in the embodiment of the present application;

fig. 6 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solution in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Before describing the image processing method provided by the embodiment of the present application, some concepts related to the embodiment of the present application are described:

Super-Resolution (Super-Resolution), also called Super-Resolution reconstruction, refers to a process of improving the Resolution of an original image by a hardware or software method, and obtaining a high-Resolution image through a series of low-Resolution images is Super-Resolution reconstruction; common methods of super-resolution reconstruction include: convolutional neural network based methods and countermeasure network generating methods, and the like.

Target detection, also called target extraction, is an image understanding algorithm based on target geometry and statistical characteristics, the target detection combines positioning and identification of targets, and the accuracy and real-time performance of target detection are important capabilities in computer vision.

Residual Network (ResNet), also known as Residual neural Network, refers to a convolutional neural Network proposed by scholars of the microsoft research institute; the residual block inside ResNet uses jump connection, and the problem of gradient disappearance caused by increasing depth in a deep neural network is relieved.

A Recurrent Neural Network (RNN), also called Recurrent Neural Network, is a type of Recurrent Neural Network (Recurrent Neural Network) in which sequence data is input, recursion is performed in the direction of evolution of the sequence, and all nodes (Recurrent units) are connected in a chain.

The super-resolution model refers to a neural network model for performing super-resolution reconstruction on an image, and can be understood as a neural network model for converting a low-resolution image into a high-resolution image.

The target detection network is a neural network for detecting a target object in an image, that is, the target object in the image is detected, and a position range, a classification and a probability of the target object in the image are given, the position range can be specifically labeled in the form of a detection frame, the classification refers to a specific class of the target object, and the probability refers to a probability that the target object in the detection frame is in the specific class.

A Regional Convolutional Neural Network (RCNN), which is the first algorithm to successfully apply deep learning to target detection; the RCNN achieves the purpose of target detection based on algorithms such as a convolutional neural network, linear regression and a support vector machine.

It should be noted that the image processing method provided in the embodiments of the present application may be executed by an electronic device, where the electronic device refers to a device terminal or a server having a function of executing a computer program, and the device terminal includes, for example: a smart phone, a Personal Computer (PC), a tablet computer, a Personal Digital Assistant (PDA), a Mobile Internet Device (MID), a network switch or a network router, and the like.

Before describing the image processing method provided by the embodiment of the present application, an application scenario to which the image processing method is applied is described, where the application scenario includes, but is not limited to: the image processing method is used for carrying out computer vision processing on the image to be processed, wherein the computer vision processing comprises but is not limited to: super-resolution reconstruction, target detection, target identification and the like; the target detection is to perform target recognition on the segmented image after the image to be processed is segmented, and when the image to be processed meets the requirement of the target recognition, the image to be processed can not be segmented, and the target recognition can be directly performed on the image to be processed.

The image processing method is used in specific application scenarios, for example: in the smart security scene, the collection of pictures is often influenced by many factors, such as: camera resolution, imaging quality, target distance, complex illumination weather and the like, which often result in too low resolution of the captured image (e.g., pedestrian, motor vehicle, license plate, etc.), or too small size of the target object in the captured image, making it difficult to detect the specific category of the target object in the image through the target detection neural network. The image processing method can be used for carrying out target detection on the collected image, the super-resolution characteristics are fused during target detection, and the specific category of the target object in the image is determined together, so that the accuracy of the target detection on the image to be processed is improved, or the image processing method can be used for carrying out super-resolution reconstruction on the collected image to obtain a high-definition image, and then carrying out target detection or target identification on the high-definition image to obtain the specific category of the target object in the image.

Please refer to fig. 1, which illustrates a flowchart of an image processing method provided in an embodiment of the present application; the main idea of the image processing method is that the super-resolution feature extracted from the image feature of the image to be processed and the target detection feature are fused, and then computer vision processing is performed according to the obtained fusion feature, so that computer vision related tasks such as super-resolution reconstruction or target detection are performed by fully utilizing the complementarity of detail information in the super-resolution feature and semantic information in the target detection feature, and the effect of computer vision processing on the image is effectively improved, and the method comprises the following steps:

step S110: and obtaining an image to be processed.

The obtaining method of the to-be-processed image in the step S110 includes, but is not limited to, the following: the first acquisition mode is that a target object is shot by using terminal equipment such as a video camera, a video recorder or a color camera, and an image to be processed is acquired; then the terminal device sends the image to be processed to the electronic device, then the electronic device receives the image to be processed sent by the terminal device, and the electronic device can store the image to be processed into a file system, a database or a mobile storage device; the second obtaining mode is to obtain a pre-stored image to be processed, and specifically includes: acquiring an image to be processed from a file system, or acquiring the image to be processed from a database, or acquiring the image to be processed from a mobile storage device; in the third obtaining mode, software such as a browser is used for obtaining the to-be-processed image on the internet, or other application programs are used for accessing the internet to obtain the to-be-processed image.

After step S110, step S120 is performed: and extracting bottom layer features of the image to be processed to obtain the bottom layer features of the image.

The image underlying features refer to extracting basic features in an image, where the basic features include, but are not limited to: texture, shading, color, brightness, channel and contrast, etc.

Please refer to fig. 2, which is a schematic process diagram of the computer vision processing provided by the embodiment of the present application; rectangles in the figure represent names of data in the calculation process, ellipses in the figure represent specific names of calculation processes in the calculation process, and arrow directions in the figure represent data flow in the calculation process; there are many embodiments of the above step S120, including but not limited to the following:

in the first implementation mode, a residual error network module is used for extracting the features of an image to be processed to obtain bottom layer image features such as texture, color and shape; the residual network module herein includes but is not limited to: ResNet22, ResNet38, ResNet50, ResNet101, and ResNet152, etc., where the residual module may specifically include: two convolution kernels of step size 2, thus achieving a down-sampling of the resolution of the image to be processed to a quarter of the size.

In a second embodiment, a Convolutional Neural Network (CNN) model is used to extract image features in an image to be processed, where the Convolutional Neural network model includes but is not limited to: LeNet network model, AlexNet network model, VGG network model, GoogLeNet network model, and so on.

After step S120, step S130 is performed: a first feature of the image features is extracted using a super-resolution model, and a second feature of the image features is extracted using a target detection network model.

The role of the super-resolution model is different from that of the target detection network model, from the viewpoint of task guidance, the task of the super-resolution model is to perform super-resolution reconstruction on image features, namely to make the image features to be processed clearer, and the task of the target detection network model is to detect the position area and specific category of a target object in the detected image to be processed; thus, although both models deal with features from the same image source, the effect of the two models on the features is quite different. There are many embodiments of extracting the first feature in the image features by using the super-resolution model in step S130, including but not limited to the following:

in a first implementation manner, please refer to fig. 3, which illustrates a network structure diagram of a residual dense module in residuals provided in an embodiment of the present application; the super-resolution model can adopt a Residual-in-Residual Dense Block (RRDB) in the Residual so as to extract a first feature related to a super-resolution task from image features; wherein, the RRDB may include: the network structure of each of the first Dense module (Dense Block), the second Dense module and the third Dense module is the same, the network structure of only one Dense module is shown in the figure, the beta value in the figure represents a residual scaling parameter (residual scaling parameter), and the plus (+) node in the figure represents a fusion operation. The network structure of the dense modules includes: convolution module (Conv) and modified Linear Unit (ReLU), although the ReLU may be replaced by leakage modified Linear Unit (lreol). In a specific implementation process, different numbers of RRDB modules may be used according to different requirements of computation amount and performance, and specifically, the number of RRDB modules may be any number between 16 and 24.

In a second embodiment, the super-resolution model may adopt a generated confrontation Network (GAN), where GAN is a learning model in machine learning, and two neural networks are used to game each other; the generation countermeasure network consists of a generator (generator) and a discriminator, the generator randomly sampling from the potential space (latency) as input data; generative confrontation networks that may be used include: GAN, styleGAN2, WGAN (Wasserstein GAN), and WGAN-GP (Wasserstein GAN-gradient dependency), and the like.

In a third embodiment, the super-resolution model may include: the system comprises an up-sampling network, a super-resolution task and a super-resolution task, wherein the up-sampling network is used for up-sampling an image to be processed to obtain an up-sampled image, extracting a depth feature in the up-sampled image, and fusing the depth feature and an image feature in the up-sampled image to obtain a first feature related to the super-resolution task; the above up-sampled image may also be obtained by using an interpolation algorithm, which specifically includes: a nearest neighbor interpolation algorithm, a bilinear interpolation algorithm, and a bicubic interpolation algorithm.

There are many embodiments of extracting the second feature from the image features by using the object detection network model in step S130, including but not limited to the following:

in the first embodiment, the target detection network model includes a plurality of residual error network modules, and therefore, the plurality of residual error network modules in the target detection network model may be used to extract a second feature in the image features, for example: the residual error network modules can be three residual error network modules with the step size of 2, so that the image features extracted from the image to be processed are downsampled, the feature map size after downsampling is 1/32 of the image feature size, and so on, when different numbers of residual error network modules and residual error network modules with different step sizes are adopted, the downsampling proportion is different, the downsampling proportion can be set according to specific practical conditions, and for example, 3 downsampled feature maps with the sizes of 1/8, 1/16 and 1/32 can be output.

In a second implementation manner, a plurality of residual error network modules in the target detection network model can also be directly copied to obtain three residual error network modules with the step length of 2, and the copied three residual error network modules with the step length of 2 are used for extracting a second feature in the image feature; a plurality of second features with different sizes can be extracted according to specific practical situations, for example: the number and the step size of the residual error network modules are set according to the practical situation, so that 3 types of down-sampled feature maps with the sizes of 1/8, 1/16 and 1/32 are output.

After step S130, step S140 is performed: and fusing the first feature and the second feature to obtain a fused feature, and performing computer vision processing on the fused feature.

Among these, computer vision processing tasks include, but are not limited to: a super-resolution reconstruction task and a target detection processing task; the above feature fusion method is different for different computer vision processing tasks, and the different fusion methods for the two tasks are further described below, and the embodiments of the computer vision processing in step S140 include, but are not limited to, the following:

in a first embodiment, please refer to fig. 4, which shows a schematic process diagram of super-resolution reconstruction provided in the embodiment of the present application; the shapes of the figures are not clear and can be explained with reference to fig. 2, where the dashed rectangles indicate optional processing results, the dashed ovals indicate optional processing steps, and the dashed arrows indicate optional data flow; after the second feature is processed, the processed feature and the first feature are fused in a semantic information and detail information layer to obtain a fusion feature, and finally super-resolution reconstruction is performed on the fusion feature, wherein the implementation mode can include:

step S141: and upsampling the second characteristic to obtain an upsampled characteristic.

The embodiment of step S141 described above includes, for example: using an interpolation algorithm or an up-sampling network model to up-sample the second characteristic to obtain an up-sampling characteristic; interpolation algorithms include, but are not limited to: a nearest neighbor interpolation algorithm, a bilinear interpolation algorithm and a bicubic interpolation algorithm; the upsampling network model includes, but is not limited to: convolutional Neural Network (CNN) models, deconvolution networks, PixelShuffle networks, and the like; specific examples thereof include: assume that the second feature includes: 1/8, 1/16 and 1/32, wherein semantic information corresponding to the characteristic maps of the 3 sizes is gradually increased; the detected features may be scaled to the same size as the first features associated with the super resolution task by an upsampling operation.

Step S142: and performing feature transformation on the upsampling features by using a first convolution module to obtain semantic features.

The embodiment of step S142 described above is, for example: performing semantic feature transformation on the upsampling features by using a first convolution module to obtain semantic features; the first convolution module may be a convolution module used alone, or a convolution module in a super-resolution model or a target detection model.

Step S143: and fusing the first feature and the semantic feature to obtain a fused feature.

The embodiment of fusing the first feature and the semantic feature in step S143 includes: a first fusion mode, which performs channel attention fusion on the first feature and the semantic feature, specifically for example: performing feature fusion on 3 sizes of 1/8, 1/16 and 1/32 on channels according to weights output by a channel attention module through semantic feature transformation; a second fusion mode, which performs spatial attention fusion on the first feature and the semantic feature, specifically for example: the 3 sizes of 1/8, 1/16, and 1/32 were feature fused spatially by semantic feature transforms according to the weights output by the spatial attention module.

Step S144: and performing super-resolution reconstruction on the fusion features to obtain a super-resolution image.

The embodiment of step S144 described above is, for example: performing super-resolution reconstruction on the fusion features by using a pixel reordering (pixel shuffle) mode to obtain a super-resolution image; wherein, one pixel reordering can realize 2 times of amplification, and the quantity of pixel reordering can be selected according to the amplification factor which needs to be realized.

Optionally, after obtaining the super-resolution image, target detection may be performed according to the super-resolution image, and after step S144, the method may further include:

step S145: and carrying out target detection processing on the super-resolution image to obtain a target detection result.

The embodiment of step S145 above is, for example: performing target detection processing on the super-resolution image by using a target detection network model to obtain a target detection result; the neural network models that can be used in the target detection network model herein include, but are not limited to: RCNN model, fast RCNN model, and the like; both fast RCNN and fast RCNN are improvements made on RCNN.

In the implementation process, the first feature related to the super-resolution task and the semantic feature obtained after the up-sampling feature is transformed are fused to obtain a fusion feature; performing super-resolution reconstruction on the fusion features to obtain a super-resolution image; therefore, the task of super-resolution reconstruction is effectively completed by utilizing detail information in the first feature and semantic information in the semantic features after the up-sampling feature transformation, and the accuracy of super-resolution reconstruction of the image to be processed is improved. Further, target detection processing is carried out on the super-resolution image to obtain a target detection result; the definition of the super-resolution image is improved by effectively utilizing the fused semantic features, so that the accuracy of target detection is improved.

In a second implementation manner, please refer to fig. 5, which illustrates a schematic process diagram of target detection processing provided in an embodiment of the present application; the shape of the figure is not clear and can be explained with reference to fig. 2. Processing the first feature, fusing the processed first feature and the processed second feature to obtain a fused feature, and finally performing target detection processing on the fused feature; the implementation may include:

step S146: and performing feature transformation on the first features by using a second convolution module to obtain detailed features.

The embodiment of step S146 is, for example: performing detail feature transformation on the first feature by using a second convolution module, and decomposing the super-resolution task-related first feature into detail features of 1/8, 1/16 and 1/32, which correspond to the same size of object detection task-related second feature, that is, the second feature may also include feature maps of 1/8, 1/16 and 1/32; the second convolution module may be a convolution module used alone, or a convolution module in a super-resolution model or a target detection model.

Step S147: and fusing the detail characteristic and the second characteristic to obtain a fused characteristic.

The embodiment of fusing the detail feature and the second feature in step S147 includes: the first fusion mode is that the detail features and the second features with the same size are subjected to channel fusion by using a channel attention module; and in the second fusion mode, the detail features and the second features with the same size are subjected to spatial attention fusion by using a spatial attention module, so that the fusion of the two features is realized.

Step S148: and carrying out target detection processing on the fusion characteristics to obtain a target detection result.

The target detection network model is a neural network model obtained by training a target detection network by using a training data set; the target detection network model may include: bounding box regression branches and classification branches.

There are many embodiments of the above step S148, including but not limited to the following:

in the first implementation mode, a Path Aggregation Network (PANET) model based on a proposed instance segmentation framework is used for carrying out target detection processing on the fusion features to obtain a target detection result; the input of the path aggregation network model may be the above three feature maps with sizes 1/8, 1/16, and 1/32, where the PANet implements fusion of the three feature maps with different sizes by a bottom-up channel and a top-down channel, and finally outputs a target detection result by a classification branch and a rectangular box regression branch, where specific calculation processes of the classification branch and the rectangular box regression branch refer to the following second implementation manner. In the implementation process, a detail feature obtained by performing feature transformation on the first feature is fused with a second feature related to the target detection task to obtain a fused feature; then, carrying out target detection processing on the fusion characteristics to obtain a target detection result; therefore, the fused detail characteristics are effectively utilized to carry out target detection processing, and the accuracy of target detection is improved.

In the second implementation mode, a target detection network model comprising a frame regression branch and a classification branch is used for carrying out target detection processing on the fusion characteristics to obtain a target detection result; specific examples thereof include: predicting all candidate frames in the fusion characteristics by using frame regression branches in the target detection network model, wherein the candidate frames represent and predict the region containing the target object in the image to be processed; the classification branch in the target detection network model is used to classify the candidate frame, and the target class of the target object corresponding to the candidate frame and the probability corresponding to the target class are obtained.

In the implementation process, firstly, the image to be processed is subjected to feature extraction to obtain image features, then a super-resolution model is used for extracting first features in the image features, and a target detection network model is used for extracting second features in the image features; finally, computer vision processing is carried out on the fusion characteristics of the first characteristics and the second characteristics; in other words, the super-resolution feature and the target detection feature extracted from the task-oriented image feature are subjected to semantic and detail level fusion, and then computer vision processing is performed according to the obtained fusion feature, so that computer vision-related tasks such as super-resolution reconstruction or target detection and the like are performed by fully utilizing the complementarity of the super-resolution feature and the target detection feature on semantic information and detail information, and the effect of performing computer vision processing on the image is effectively improved.

Please refer to fig. 6, which illustrates a schematic structural diagram of an image processing apparatus according to an embodiment of the present application; an embodiment of the present application provides an image processing apparatus 200, including:

a processed image obtaining module 210, configured to obtain an image to be processed.

The image feature obtaining module 220 is configured to perform bottom layer feature extraction on the image to be processed to obtain bottom layer features of the image.

An image feature extraction module 230, configured to extract a first feature of the image features using the super-resolution model, and extract a second feature of the image features using the target detection network model.

And the fusion visual processing module 240 is configured to perform semantic and detail level fusion on the first feature and the second feature to obtain a fusion feature, and perform computer visual processing on the fusion feature.

Optionally, in an embodiment of the present application, the computer vision processing includes: reconstructing super-resolution; a fusion vision processing module comprising:

and the sampling characteristic obtaining module is used for performing up-sampling on the second characteristic to obtain an up-sampling characteristic.

And the semantic feature obtaining module is used for performing feature transformation on the up-sampling features by using the first convolution module to obtain the semantic features.

And the first feature fusion module is used for fusing the first feature and the semantic feature to obtain a fusion feature.

And the super-resolution reconstruction module is used for performing super-resolution reconstruction on the fusion features to obtain a super-resolution image.

Optionally, in an embodiment of the present application, the first feature fusion module includes:

and the semantic feature fusion module is used for performing channel attention fusion on the first feature and the semantic feature, or performing space attention fusion on the first feature and the semantic feature.

Optionally, in an embodiment of the present application, the fusion vision processing module further includes:

and the first target detection module is used for carrying out target detection processing on the super-resolution image to obtain a target detection result.

Optionally, in an embodiment of the present application, the computer vision processing includes: target detection processing; a fusion vision processing module comprising:

and the detail feature obtaining module is used for performing feature transformation on the first features by using the second convolution module to obtain the detail features.

And the second feature fusion module is used for fusing the detail features and the second features to obtain fusion features.

And the second target detection module is used for carrying out target detection processing on the fusion characteristics to obtain a target detection result.

Optionally, in an embodiment of the present application, the second feature fusion module includes:

and the detail feature fusion module is used for performing channel attention fusion on the detail features and the second features or performing spatial attention fusion on the detail features and the second features.

Optionally, in this embodiment of the present application, the target detection network model includes: frame regression branches and classification branches; a second target detection module comprising:

and the candidate frame prediction module is used for predicting all candidate frames in the fusion characteristics by using frame regression branches, and the candidate frame representation predicts the region containing the target object in the image to be processed.

And the candidate frame classification module is used for classifying the candidate frames by using the classification branches to obtain the target classes of the target objects corresponding to the candidate frames and the probabilities corresponding to the target classes.

It should be understood that the apparatus corresponds to the above-mentioned embodiment of the image processing method, and can perform the steps related to the above-mentioned embodiment of the method, and the specific functions of the apparatus can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy. The device includes at least one software function that can be stored in memory in the form of software or firmware (firmware) or solidified in the Operating System (OS) of the device.

Please refer to fig. 7 for a schematic structural diagram of an electronic device according to an embodiment of the present application. An electronic device 300 provided in an embodiment of the present application includes: a processor 310 and a memory 320, the memory 320 storing machine readable instructions executable by the processor 310, the machine readable instructions when executed by the processor 310 performing the method as above.

The embodiment of the present application further provides a storage medium 330, where the storage medium 330 stores thereon a computer program, and the computer program is executed by the processor 310 to perform the method as above.

The storage medium 330 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules of the embodiments in the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an alternative embodiment of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the embodiments of the present application.

Claims

1. An image processing method, comprising:

obtaining an image to be processed;

extracting the features of the image to be processed to obtain image features;

extracting a first feature of the image features using a super-resolution model and a second feature of the image features using a target detection network model;

fusing the first feature and the second feature to obtain a fused feature, and performing computer vision processing on the fused feature;

the super-resolution model adopts a residual dense module in a residual, the first feature and the second feature are fused to obtain a fused feature, and the fused feature is subjected to computer vision processing, and the method comprises the following steps: performing feature transformation on the second features by using a first convolution module to obtain semantic features; fusing the first feature and the semantic feature to obtain a fused feature; performing super-resolution reconstruction on the fusion features to obtain a super-resolution image;

the fusing the first feature and the second feature to obtain a fused feature, and performing computer vision processing on the fused feature, including: performing feature transformation on the first features by using a second convolution module to obtain detail features with the same size; fusing the detail features with the same size and the second features to obtain fused features; and carrying out target detection processing on the fusion characteristics to obtain a target detection result.

2. The method of claim 1, wherein fusing the first feature and the semantic feature comprises:

and performing channel fusion on the first feature and the semantic feature, or performing space attention fusion on the first feature and the semantic feature.

3. The method of claim 1, further comprising, after said obtaining the super-resolution image:

and carrying out target detection processing on the super-resolution image to obtain a target detection result.

4. The method of claim 3, wherein said fusing the detail feature and the second feature comprises:

and performing channel fusion on the detail feature and the second feature, or performing spatial attention fusion on the detail feature and the second feature.

5. The method of claim 1, wherein the target detection network model comprises: frame regression branches and classification branches; the target detection processing on the fusion features comprises:

predicting all candidate frames in the fusion feature by using the frame regression branch, wherein the candidate frame representation predicts a region containing a target object in the image to be processed;

and classifying the candidate frame by using the classification branch to obtain a target class of the target object corresponding to the candidate frame and the probability corresponding to the target class.

6. An image processing apparatus characterized by comprising:

the processing image obtaining module is used for obtaining an image to be processed;

the image characteristic obtaining module is used for extracting the characteristics of the image to be processed to obtain image characteristics;

the image feature extraction module is used for extracting a first feature from the image features by using a super-resolution model and extracting a second feature from the image features by using a target detection network model;

the fusion visual processing module is used for fusing the first feature and the second feature to obtain a fusion feature and performing computer visual processing on the fusion feature;

7. An electronic device, comprising: a processor and a memory, the memory storing machine-readable instructions executable by the processor, the machine-readable instructions, when executed by the processor, performing the method of any of claims 1 to 5.

8. A storage medium, having stored thereon a computer program which, when executed by a processor, performs the method of any one of claims 1 to 5.