WO2023060746A1

WO2023060746A1 - Small image multi-object detection method based on super-resolution

Info

Publication number: WO2023060746A1
Application number: PCT/CN2021/138098
Authority: WO
Inventors: 秦文健; 高帅强
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2021-10-14
Filing date: 2021-12-14
Publication date: 2023-04-20
Also published as: CN113920013A; CN113920013B

Abstract

A small image multi-object detection method based on super-resolution. The method comprises: acquiring a first-resolution image of an original scene (S110); converting the first-resolution image into a second-resolution image by using a reversible neural network model, then transmitting the second-resolution image, and then restoring the second-resolution image to the first-resolution image, wherein the resolution of the second-resolution image is lower than that of the first-resolution image (S120); inputting the first-resolution image obtained after restoration into a trained super-resolution diffusion model, executing super-resolution reconstruction by means of a random iterative denoising process, and outputting an ultra-high-resolution image (S130); and executing object detection on the ultra-high-resolution image, so as to obtain object identification information (S140). By means of the method, the obstacle detection precision in a low-resolution scenario is improved, and a blind guidance device can operate for a long time, thereby alleviating the burden of a user.

Description

A method of multi-target detection in small images based on super-resolution

technical field

The present invention relates to the technical field of natural image processing, and more specifically, to a super-resolution-based multi-target detection method for small images.

Background technique

At present, there are many inconveniences for the visually impaired to travel. The design of intelligent blind guide not only helps them to better identify obstacles when traveling, but also brings great convenience to their daily life. With the explosion of artificial intelligence, the emergence of deep learning and convolutional neural networks has enabled computer vision to gradually subvert the traditional blind-guiding technology that relies on ultrasonic and other obstacle avoidance in the application of blind-guiding, and has solved the problem of complex and difficult-to-handle obstacle detection.

In the existing technology, the blinding technology based on deep target detection usually uploads the collected images to the server, and then uses a supervised or semi-supervised method to train the network for processing, and then combines other sensory information for blinding. This type of method makes full use of the advantages of deep learning in processing complex images, and has a very good performance in general blind-guiding scenarios. Although through deep learning, blind guide equipment can more accurately identify common objects in the life scenes of blind people, such as trash cans, chairs, people, etc. However, for low-resolution scenes, the detection results of such methods are not satisfactory. Most vision-based blind-guiding technologies are realized by using high-resolution color image training network, but limited by equipment factors, it is difficult to collect high-resolution image information, or the detection of high-resolution images requires high computing power and time. In low-resolution scenes, the effectiveness of the target features of the image is greatly reduced, the information contained is very little, and it is difficult to identify the outline and category of the object.

The current super-resolution technology generally learns the correspondence between low-resolution and high-resolution images, which are divided into image super-resolution, feature map super-resolution and target super-resolution, and low-resolution images or feature maps are used as input. , output a high-resolution image or feature map, and compare it with the real high-resolution image or feature map.

Existing image object detection is usually divided into two categories: one is a two-stage detector, such as Faster R-CNN. The other is a one-stage detector, such as YOLO, SSD. Two-stage detectors have high localization and object recognition accuracy, while one-stage detectors have high inference speed. The existing high-performance target detection algorithm takes a high-resolution image as input and outputs the coordinates and category of the target.

In general, the obstacle detection methods of blind guide equipment are divided into traditional non-vision, traditional machine vision and machine vision methods based on deep learning. Traditional non-vision only uses ultrasonic and infrared sensors, and the judgment of obstacles is limited to azimuth and distance, and the accuracy is low. Traditional machine vision mainly uses pre-written algorithms to perform feature recognition on targets in images. This method has poor migration ability and is not intelligent. The machine vision method based on deep learning learns the characteristics of images through data set training, can identify images of various scenes, and perform target detection, and the detection effect is also very good, but this method requires high-resolution image acquisition equipment and high-performance information Transmission and processing equipment, in the scene of wearable guide detection, image acquisition and processing need to consider power consumption, volume and weight, etc., and because the object information contained in the low-resolution image is very little, this method is difficult to effectively detect obstacle.

Contents of the invention

The purpose of the present invention is to overcome the defects of the above-mentioned prior art, and provide a method for detecting multi-targets in small images based on super-resolution, which includes: acquiring the first resolution image of the original scene; using a reversible neural network model to convert the first The resolution image is converted into a second resolution image and transmitted, and then restored to the first resolution image, wherein the resolution of the second resolution image is lower than the first resolution image; the restored first resolution image is input to The trained super-resolution diffusion model performs super-resolution reconstruction through a random iterative denoising process to output a super-resolution image; performs target detection on the super-resolution image to obtain target recognition information.

Compared with the prior art, the present invention has the advantages of introducing a super-resolution structure in the blind-guiding auxiliary detection process to enrich image information; introducing a diffusion probability model, adding features of high-resolution images, and improving image quality in low-resolution scenarios. Obstacle detection accuracy.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments of the present invention with reference to the accompanying drawings.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention.

Fig. 1 is the flow chart of the multi-object detection method of small image based on super-resolution according to one embodiment of the present invention;

2 is a schematic diagram of the spatial structure of a super-resolution-based small image multi-target detection method according to an embodiment of the present invention;

Fig. 3 is a network structure diagram of an image scaling module according to an embodiment of the present invention;

Fig. 4 is a network structure diagram of a super-resolution module according to an embodiment of the present invention;

Fig. 5 is a schematic diagram of a target detection module according to an embodiment of the present invention.

Detailed ways

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangements of components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and in no way taken as limiting the invention, its application or uses.

Techniques, methods and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods and devices should be considered part of the description.

In all examples shown and discussed herein, any specific values should be construed as exemplary only, and not as limitations. Therefore, other instances of the exemplary embodiment may have different values.

It should be noted that like numerals and letters denote like items in the following figures, therefore, once an item is defined in one figure, it does not require further discussion in subsequent figures.

The super-resolution-based small-image multi-target detection method provided by the present invention generally includes image acquisition, image scaling, super-resolution (that is, reconstructing a corresponding high-resolution image from a low-resolution image), target detection and post-processing, etc. process.

Specifically, as shown in Figure 1 and Figure 2, the provided super-resolution-based small image multi-target detection method includes the following steps:

Step S110, acquiring an original scene image.

For example, the original image of the scene is acquired by the camera in the headset and passed to the image scaling module. While acquiring the image, record the location and status information such as the height and inclination of the device, so that it can be processed together with the target location information into information that the blind can feel.

Step S120, reducing the resolution of the original image, and transmitting the reduced resolution image to the server to restore the original resolution.

In this step, the original image is input to the scaling module, the low-resolution image and latent variables are output, and transmitted to the server side together, and the scaling module on the server side restores the low-resolution image and latent variables to the original resolution. By reducing image resolution, bandwidth and delay can be reduced, thereby reducing transmission costs.

For example, Normalized Flow is a powerful generative probabilistic model that uses reversible neural networks to learn downscaling and upscaling for image rescaling. Reversible neural networks are used to implement the mapping of implicit parameters to measurable values, which is called the forward process. The reverse process is to obtain the implicit parameters according to the measured values. Since the reversible neural network model is bijective, it can recover high-resolution images with high accuracy after downscaling.

The schematic diagram of the image scaling process is shown in Figure 2, including M1, M2 and M3, where the structure of M1 is shown in Figure 3, M2 is a convolutional feature extraction network, and M3 is P flow-steps, including the activation normalization layer (Act -norm), 1×1 convolutional layer (1×1conv), affine coupling layer (affine coupling), y represents the image after reducing the resolution, and a represents the intermediate feature layer.

In one embodiment, the loss function for training a reversible neural network is set to:

Where x is the original resolution input, y is the low-resolution output, z is the latent variable output, x _τ-1 is the high-resolution image restored by y and z, and y ^* is the low-resolution x obtained by bicubic linear interpolation rate image;

is y ^* and y's

pixel loss,

is x and x _τ-1 of

pixel loss,

is the latent variable z

Regularization, λ ₁ , λ ₂ , λ ₃ are the weights of the corresponding items.

In this step, the image scaling module scales the image to its original size.

Step S130, performing super-resolution reconstruction on the zoomed image to obtain a super-resolution image.

For example, the output restored image is super-resolved to a high-resolution size by 16 times using the super-resolution diffusion model, and the denoising diffusion probability model is used to perform super-resolution through a stochastic iterative denoising process.

In one embodiment, the super-resolution model SR3 (Image Super-Resolution) or the conditional diffusion probability denoising model is used for image super-resolution reconstruction. The working principle is to learn to transform the standard normal distribution through a series of refinement steps for the empirical data distribution. The super-resolution network structure is shown in Figure 4, using the U-Net architecture, which is trained with denoising targets to iteratively remove various levels of noise from the output.

The conditional diffusion probabilistic denoising model generates the target image y ₀ in T refinement steps. The model starts from a pure noise image y _T ～N(0,I), and transfers the distribution p _θ (y _T-1 |y _t ,x) according to the learned conditional distribution through continuous iterations (y _T-1 ,y _{T- 2} ,...,y ₀ ) such that y ₀ ~p(y|x).

Still combined with Figure 4, taking the low-resolution image size 8×8 as an example, in order to make the model conditional on the input x, use deconvolution calculation to upsample the low-resolution image to the target resolution, and the result is concatenated with _yt together.

The distribution of intermediate images in the inference chain is defined according to a forward diffusion process that gradually adds Gaussian noise to the signal via a fixed Markov chain denoted q( _yt |yt _-1 ). The goal of the model is to reverse the Gaussian diffusion process by iteratively recovering the signal from the noise via a reverse Markov chain conditioned on x (the low-resolution image). The inverse chain is learned using a denoising model f _θ that takes as input a source image and a noisy target image and estimates the noise. The training objective function is set as, for example:

where ∈～N(0,I), x represents a low-resolution image, y represents a high-resolution image, (x, y) is sampled from the training data set, y ₀ represents the original high-resolution image,

Represents the image after adding noise to x, γ represents the noise scale, and p(γ) represents the distribution of γ, ie

p∈{1,2}, when p is 1, it means

loss, when p is 2, it means

The square of the loss, T is the total number of diffusion times, t is the index of the number of diffusion times, and f _θ is the conditional diffusion probability denoising model.

Each iteration of iterative refinement under the model takes the form:

where ∈ _t ~N(0,I), α _t is a hyperparameter with a value range of 0<α _t <1, which determines the variance of the noise added in each iteration,

Step S140, based on the ultra-high resolution image, detect the category and position of the target.

In this step, the ultra-high resolution image is input into the target detector, and the category and coordinate information of the target are output.

For example, as shown in Figure 5, feature pyramids are used to achieve multi-scale object detection. Feature pyramids are a fundamental building block in multi-scale object detection. Although high-level features contain rich semantic information, it is difficult to accurately preserve the location information of objects due to low resolution. In contrast, although the low-level features have less semantic information, they can accurately contain object location information due to their high resolution. Combine low-level features and high-level features to build a feature pyramid, and input each feature map into the prediction head, so as to realize a target detection system with accurate recognition and positioning, and detect target information, for example, including the category of the target and location information, etc.

Preferably, since simple up-sampling can also greatly improve the target detection performance, the target detection module interpolates the ultra-low resolution image, stitches it with the high-resolution image, and jointly inputs it to the feature extraction module to obtain The results are sorted by weight.

In step S150, the target information is fused with the device status information, and transformed into perceivable information.

In this step, the target information is fused with the device status information by using the post-processing module, and converted into information that blind people can feel.

In order to further understand the present invention, an embodiment of the super-resolution reconstruction process will be specifically described below, taking 8*8→128*128 as an example for illustration.

1), build a training set

Ignore the pictures whose short sides are smaller than 128 pixels, and crop the center of the remaining pictures to a size of 128*128 as a high-resolution picture y ₀ ; apply the bicubic interpolation algorithm to downsample the high-resolution picture by 16 times to a size of 8*8 as a low-resolution picture Resolution picture x, all pairs of high and low resolution images constitute the training set.

2), training super-resolution diffusion model

For example, the experiment setup is as follows:

batch_size: 256;

Optimizer: Adam

Learning rate: 1e-4

Number of iterations: training 2000, inference 100, α ₀ =0.9, α _T =-19.

During the training process, the low-resolution images (256, 3, 8, 8) are deconvolutionally calculated and upsampled by 16 times to (256, 3, 128, 128), and the noise images are stitched into (256, 6, 128, 128) as network input. The network loss is obtained by formula 2, and then the gradient is calculated and backpropagated to update the network weights.

3) Use the trained model for inference

Specifically, the reasoning process is: splicing the interpolated low-resolution images x and y _T , and obtaining y _T-1 from formula 3, and similarly, obtaining y _T-2 from x and y _T-1 , after T iterations Then get y ₀ .

Further, the interpolated low-resolution images x and y ₀ are spliced and input into the target detector to obtain two sets of target positions and categories. After weighted sorting, the non-maximum value suppression operation is performed to obtain the final result.

The present invention performs super-resolution on the low-resolution image through the diffusion probability model, and realizes the 16-fold down-conversion of the ultra-low-resolution image (such as the lowest 8*8 pixels) to the high-resolution image (such as 128*128 pixels), and then The high-resolution image is detected by the target detection module, which solves the problem of poor robustness and low accuracy of target detection in low-resolution scenarios faced by the guide technology, and reduces power consumption of the device.

In summary, the present invention designs a super-resolution-based small image multi-target detection method, which solves the problem that the effect of obstacle detection in the blind-guiding technology becomes worse in ultra-low-resolution scenarios; using image scaling technology, The original image is scaled to a low-resolution image for low-cost transmission, and then the low-resolution image is restored to a high-quality original image; the image super-resolution technology based on the diffusion probability model is used to realize the low-resolution image when guiding the blind. Target detection is carried out on images of blind people's life scenes, so as to provide a solution for the existing blind-guiding technology; at the same time, low-resolution images and high-resolution image information are used to improve detection accuracy. In a word, the present invention uses lower-resolution images as the original input, so that the blind-guiding device can accommodate low-resolution cameras, and at the same time applies image scaling technology to reduce the amount of data transmission, reduce power consumption and reduce the volume of the device during data transmission, so that The blind guide device can work for a long time, reducing the burden on users.

The present invention can be a system, method and/or computer program product. A computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present invention.

A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.

Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

Computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, Python, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect). In some embodiments, an electronic circuit, such as a programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (PLA), can be customized by utilizing state information of computer-readable program instructions, which can Various aspects of the invention are implemented by executing computer readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.

It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions. It is well known to those skilled in the art that implementation by means of hardware, implementation by means of software, and implementation by a combination of software and hardware are all equivalent.

Having described various embodiments of the present invention, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or technical improvement in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein. The scope of the invention is defined by the appended claims.

Claims

A super-resolution-based small image multi-target detection method, comprising the following steps:

Step S1: Acquiring the first resolution image of the original scene;

Step S2: Use the reversible neural network model to convert the first-resolution image into a second-resolution image and then transmit it, and then restore it to the first-resolution image, wherein the resolution of the second-resolution image is lower than that of the first-resolution image ;

Step S4: Input the restored first resolution image into the trained super-resolution diffusion model, perform super-resolution reconstruction through a random iterative denoising process, and output a super-resolution image;

Step S4: Perform target detection on the ultra-high resolution image to obtain target identification information.
method according to claim 2, is characterized in that, the loss function of training described reversible neural network model is set to:

Where x is the first resolution image input, y is the second resolution image output, z is the latent variable output, x τ-1 is the first resolution image restored by y and z, y * is x after bicubic linear The second resolution image obtained by interpolation,
is y * and y's
pixel loss,
is x and x τ-1 of
pixel loss,
is the latent variable z
Regularization, λ 1 , λ 2 , λ 3 are the weights of the corresponding items.
The method according to claim 1, wherein the super-resolution diffusion model adopts a Unet architecture, and learns to convert a standard normal distribution into an empirical data distribution through T refinement steps.
The method according to claim 4, characterized in that, in the T refinement steps, the super-resolution diffusion model starts from a pure noise image, and the generated target image is made conform to a predetermined probability distribution.
The method according to claim 1, wherein the training objective function of the super-resolution diffusion model is set to:

where ∈～N(0,I), x represents a low-resolution image, y represents a high-resolution image, (x, y) is sampled from the training data set, y 0 represents the original high-resolution image,
Indicates the image after adding noise to x, and γ indicates the noise scale,
When p is 1, it means
loss, when p is 2, it means
The square of the loss, T is the total number of diffusions, t is the index of the number of diffusions, f θ is the super-resolution diffusion model, and each iteration under the model takes the following form:

Where ∈ t ~N(0,I), α t is a hyperparameter, and the value range is 0<α t <1.
The method according to claim 1, characterized in that in step S4, low-level features and high-level features are fused to construct a feature pyramid, and each feature map is input into the prediction head to obtain the category and location information of the target .
The method according to claim 1, wherein the training set of the super-resolution diffusion model is constructed according to the following steps:

Crop the collected picture to the target high-resolution size as a high-resolution picture;

Apply the bicubic interpolation algorithm to downsample the high-resolution image to the target low-resolution size as a low-resolution image;

All pairs of high and low resolution images constitute the training set.
The method according to claim 1, wherein the camera in the head-mounted device is used to obtain the first resolution image of the original scene, and the obtained target identification information is fused with the device status information to convert it into an image that the user can feel. information.
A computer-readable storage medium, on which a computer program is stored, wherein, when the program is executed by a processor, the steps of the method according to any one of claims 1 to 8 are realized.
A computer device comprising a memory and a processor, wherein a computer program capable of running on the processor is stored on the memory, wherein any one of claims 1 to 8 is implemented when the processor executes the program The steps of the method described in the item.