CN114937153B - Visual characteristic processing system and method based on neural network in weak texture environment - Google Patents

Visual characteristic processing system and method based on neural network in weak texture environment Download PDF

Info

Publication number
CN114937153B
CN114937153B CN202210663043.2A CN202210663043A CN114937153B CN 114937153 B CN114937153 B CN 114937153B CN 202210663043 A CN202210663043 A CN 202210663043A CN 114937153 B CN114937153 B CN 114937153B
Authority
CN
China
Prior art keywords
original image
branch
detector
descriptor
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210663043.2A
Other languages
Chinese (zh)
Other versions
CN114937153A (en
Inventor
方浩
胡家瑞
王奥博
陈杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202210663043.2A priority Critical patent/CN114937153B/en
Publication of CN114937153A publication Critical patent/CN114937153A/en
Application granted granted Critical
Publication of CN114937153B publication Critical patent/CN114937153B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a visual characteristic processing system and a visual characteristic processing method based on a neural network in a weak texture environment, wherein the processing system comprises the following steps: the system comprises a backbone network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical sub-branches of a twin network; the backbone network carries out convolution processing on an original image and outputs a deep feature map of the original image; the output of the first space module is fused with the output of a first convolution layer in the detector branch, and the detector branch outputs a corner probability map which is used for representing the probability that each point in the original image is a corner; the output of the second spatial module is fused with the output of the first convolution layer in the descriptor branch, and the descriptor branch outputs a description subgraph which is used for representing the description sub-form of each point in the original image.

Description

Visual characteristic processing system and method based on neural network in weak texture environment
Technical Field
The invention relates to the field of computer vision, in particular to a visual characteristic processing system and method based on a neural network in a weak texture environment.
Background
In recent years, artificial intelligence is actively developed, global automation patterns are gradually formed, computer vision is used as one of core perception technologies, considerable social, economic and academic values are created, social application depth and breadth are continuously enhanced, and industries such as security, medicine, agriculture and forestry, manufacturing and the like gradually enter into a visual intelligence era, so that computer vision is an indispensable precursor technology in intelligent innovation. The feature information is crucial in the technical process of realizing visual enabling, and is a key mark for the computing system to understand and identify the image. Researchers propose rich feature design schemes based on graphics, and feature information with distinction and repeatability provides good operation primitives for visual tasks such as image retrieval, image splicing, VSLAM, three-dimensional reconstruction and the like, wherein the VSLAM scheme endows intelligent bodies such as unmanned planes, unmanned vehicles and the like with self-positioning and environment perception capabilities, and is an important technical drive for promoting intelligent unmanned construction. However, the traditional image features based on geometry are excessively dependent on image quality, are naturally very sensitive to imaging environment changes, and can cause feature quality degradation when facing to the common weak texture severe scene shown in fig. 1, so that feature algorithm failure is caused, and tasks such as VSLAM and the like are collapsed. The feature processing technology still has obvious defects in the aspects of resisting environmental interference, coping with equipment noise and adapting to motion change, and technological innovation and product incubation industry bring out an urgent need for robust and accurate feature extraction and description algorithms. Aiming at visual positioning and mapping tasks in a weak texture environment, the existing solutions are as follows:
scheme 1: yiK M, etc., LIFT Learned Invariant Feature Transform [ J ]. The proposal utilizes a motion structure recovery method to construct a supervision signal to make up for the data deficiency, and realizes the interactive connection and End-to-End synchronous learning of three subtask networks (a detector, a direction estimator and a descriptor) under a unified framework. However, the sub-networks in the LIFT model fail to form a calculation sharing relation, so that the LIFT features are difficult to meet the real-time application requirements.
Scheme 2: detone D et al, superPoint: self-Supervised Interest Point Detection and Description [ J ]. According to the scheme, a twin neural network design is adopted, calculation sharing between a detection network and a description network is basically realized, the real-time performance of the method is excellent, a self-labeling method is adopted in the SuperPoint scheme to obtain training samples, a labeling device and homography transformation are utilized to label an original image to finish false true values, and the SuperPoint network can output denser and more repeatable image features due to the homography transformation mechanism. However, this work uses implicit methods to model spatial characteristics, which are not ideal in effect.
Scheme 3: dusmanu M et al, A trainable cnn for joint description and detection of local features [ C ]. The scheme provides a concept of synchronous detection and description, breaks through the traditional mode of 'first detection and then description' in the time dimension, and the network output simultaneously comprises characteristic position score and descriptor information, so that the D2-Net truly realizes complete fusion of the detector and the descriptor, and achieves good effects in the network efficiency level. However, D2-Net performs poorly in terms of feature accuracy.
Disclosure of Invention
In view of the above, the present invention provides a system and a method for processing visual features based on a neural network in a weak texture environment, which can solve the technical problem of how to reduce the interference of the weak texture scene to the feature extraction and description process in the existing weak texture environment.
The present invention is so implemented as to solve the above-mentioned technical problems.
A neural network-based visual feature processing system in a weak texture environment, comprising:
the system comprises a backbone network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical sub-branches of a twin network;
the backbone network is used for receiving an input original image, carrying out convolution processing on the original image and outputting a deep feature map of the original image; the backbone network comprises a plurality of cascaded convolution layers, wherein shallow feature maps obtained after shallow convolution of the backbone network are simultaneously input into a first space module and a second space module; the first space module and the second space module are respectively used for space invariance reduction;
the detector branches comprise a plurality of cascade convolution layers, the output of the first space module is fused with the output of the first convolution layer in the detector branches, and the detector branches output corner probability diagrams which are used for representing the probability that each point in the original image is a corner;
the descriptor branch comprises a plurality of cascaded convolution layers, the output of the second spatial module is fused with the output of the first convolution layer in the descriptor branch, and the descriptor branch outputs a description subgraph which is used for representing the description sub-morphology of each point in the original image.
Preferably, the detector branch is configured to receive a deep feature map of the original image output by the backbone network, where the detector branch includes a plurality of cascaded convolution layers, and the output of the first spatial module is fused with the output of the first convolution layer in the detector branch, and the detector branch outputs a corner probability map, where the corner probability map is used to characterize a probability that each point in the original image is a corner;
the descriptor branch is used for receiving a deep feature map of the original image output by the backbone network, the descriptor branch comprises a plurality of cascaded convolution layers, wherein the output of the second space module is fused with the output of the first convolution layer in the descriptor branch, and the descriptor branch outputs a description subgraph which is used for representing the description sub-morphology of each point in the original image.
Preferably, the detector branches divide the original image by using an information quantity loss function and using an 8×8 neighborhood as a base unit in the training process to obtain a base unit grid, and the total H in the grid is assumed C ×W C A plurality of base units, each base unit represented by x hw The true value data label set in the real scene data set is marked as Y, and the specific loss function of the detector branch is as follows:
Figure BDA0003681107870000041
Figure BDA0003681107870000042
wherein H is C As the total number of rows of the basic cell grid, W C For the total column number of the basic unit grid, h is the row index of the basic unit grid, w is the column index of the basic unit grid, y is the pixel position of the corner in one basic unit, l p Normalizing the network predicted value of the pixel position of the diagonal point and taking negative logarithm, x hwy Is the network predicted value, x, of the pixel position of the corner in a basic unit hwk K is the channel number, which is the network predictor at any pixel location in a base unit.
Preferably, the descriptor branch adopts a range-loss function in the training process, and the specific form is as follows:
descriptive subgraphs corresponding to the original image: d, homography transformation: h, the original image is subjected to homography transformation to obtain a description subgraph corresponding to the deformed image: d'.
Descriptors corresponding to the original image: d, d hw Descriptors corresponding to the deformed image: d' h′w′
8×8 neighborhood center pixel coordinates within the original image: p is p hw
8×8 neighborhood center pixel coordinates within the deformed image: p's' h′w′
Judging the corresponding relation:
Figure BDA0003681107870000043
Figure BDA0003681107870000044
l d (d hw ,d′ h′w′ ,s)
=λ d *s*max(0,m p -d hw T d′ h′w′ )+(1-s)*max(0,d hw T d′ h′w′ -m n )
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003681107870000051
hp, for descriptor loss function hw Is the coordinate lambda of the 8X 8 neighborhood central pixel in the original image after homography transformation d For the weight parameter, s is the corresponding relation judgment parameter, m p Is the forward edge parameter, d hw T Is d hw Is a transpose of (2); h is the row index of the basic unit grid corresponding to the original image, w is the column index of the basic unit grid corresponding to the original image, h 'is the row index of the basic unit grid corresponding to the deformed image, w' is the column index of the basic unit grid corresponding to the deformed image, m n Is a negative edge parameterA number.
Further, the first space module and the second space module each comprise a plurality of convolution networks, a grid generator, a sampling network and a sampler; the space module receives a shallow feature map in a main network as input, a six-degree-of-freedom affine transformation matrix is obtained through convolution operation of the convolution networks, the obtained six-degree-of-freedom affine transformation matrix is input into a grid generator, the grid generator generates grids to obtain a sampling grid, and the sampler samples pixels of the shallow feature map in the main network according to the sampling grid to obtain a space conversion feature map.
Further, the training of the processing system is five-stage training, wherein in the first stage, data enhancement operation is performed on a training sample data set, and then the detector branches are independently trained by using the training sample data set; a second stage, marking the real scene data set by utilizing the detector branches obtained by the training in the first stage, and obtaining a feature marking data set in the real scene; a third stage, namely completely emptying the weight parameters of the detector branches obtained by training in the first stage, and independently retraining the detector branches by using the feature annotation data set obtained in the second stage; a fourth stage, re-labeling the real scene data set by using the detector branch obtained in the third stage to obtain a secondary labeling data set; and a fifth stage, clearing the weight of the detector branch and the descriptor branch, and utilizing the secondary labeling data set to perform joint training on the detector branch and the descriptor branch.
The invention provides a visual characteristic processing method based on a neural network in a weak texture environment, which comprises the following steps:
step S1: acquiring an original image;
step S2: inputting the original image into the processing system;
step S3: the processing system performs feature detection and description on the original image to obtain corner points and corresponding descriptors of the original image;
step S4: based on the corner points and descriptors of the original image, image stitching, visual positioning and scene recognition can be completed in a weak texture environment.
The beneficial effects are that:
the invention fully plays the advantages of the deep learning method, guides the network to pay attention to the scene area with rich texture information in a data driving mode, and enhances the overall spatial stability and sensitivity of the network by adding a spatial processing module in a targeted manner. In the invention, the space module is connected into the twin part in a layer jump connection mode, and the space adaptation capacity of the network model is expanded on the premise of guaranteeing the authenticity of the deep features of the image.
The method has the following technical effects:
(1) The invention provides a visual feature processing system based on a neural network to reduce the adverse effect of a weak texture scene on visual feature processing work, and the data driving method breaks through the geometric rule constraint suffered by the traditional feature algorithm, so that the real-time performance is ensured, and the image information utilization rate is further improved.
(2) According to the invention, a space converter module is introduced, and the converted space conversion feature map and the original feature map are subjected to cascade superposition, so that the processing method explicitly models the space characteristics of the image, and compared with the implicit modeling method in the prior work, the processing method has more excellent performance.
(3) The invention adopts the self-supervision labeling method to complete training, solves the problems of subjective errors and sample deletion of manual labeling, further improves the data utilization rate, fully develops the potential of a network structure, furthest reduces the damage of scene limitation to a feature network, and has remarkable significance for enhancing the practical application value of the deep learning technology in the problems of feature extraction and description.
(4) According to the invention, through the twin neural network architecture and the self-supervision labeling training strategy, strong constraint brought by geometric rules in the feature extraction process is eliminated, so that the network has excellent robustness, flexibility and scene adaptability, and the external environment interference and algorithm complexity are reduced.
(5) The processing system is a twin architecture, a feature processing algorithm is adopted, a standard system integrating feature extraction and description is established, and a space module is explicitly modeled by a structure shown in fig. 3, so that the specificity and the space quality of the extracted scene features are ensured.
Drawings
FIG. 1 is a schematic diagram of a weak texture scene;
FIG. 2 is a schematic diagram of a visual feature processing system architecture based on a neural network in a weak texture environment provided by the present invention;
FIG. 3 is a schematic diagram of a space module architecture according to the present invention;
FIGS. 4 (A) -4 (B) are schematic diagrams of synthetic datasets provided by the present invention;
FIG. 5 is a schematic representation of a real scene data set used in the present invention;
fig. 6 (a) -6 (B) are diagrams of output results of the detector provided by the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and examples.
As shown in fig. 2-3, the present invention proposes a visual feature processing system based on a neural network in a weak texture environment, where the processing system includes:
the system comprises a backbone network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical sub-branches of a twin network;
the backbone network is used for receiving an input original image, carrying out convolution processing on the original image and outputting a deep feature map of the original image; the backbone network comprises a plurality of cascaded convolution layers, wherein shallow feature maps obtained after shallow convolution of the backbone network are simultaneously input into a first space module and a second space module; the first space module and the second space module are respectively used for space invariance reduction;
the detector branches comprise a plurality of cascade convolution layers, the output of the first space module is fused with the output of the first convolution layer in the detector branches, and the detector branches output corner probability diagrams which are used for representing the probability that each point in the original image is a corner;
the descriptor branch comprises a plurality of cascaded convolution layers, the output of the second spatial module is fused with the output of the first convolution layer in the descriptor branch, and the descriptor branch outputs a description subgraph which is used for representing the description sub-morphology of each point in the original image.
The shallow convolution means that the image is only processed by partial convolution layers and does not reach the deep degree.
Further, the detector branch is configured to receive a deep feature map of the original image output by the backbone network, where the detector branch includes a plurality of cascaded convolution layers, and the output of the first spatial module is fused with the output of the first convolution layer in the detector branch, and the detector branch outputs a corner probability map, where the corner probability map is used to characterize a probability that each point in the original image is a corner;
the descriptor branch is used for receiving a deep feature map of the original image output by the backbone network, the descriptor branch comprises a plurality of cascaded convolution layers, wherein the output of the second space module is fused with the output of the first convolution layer in the descriptor branch, and the descriptor branch outputs a description subgraph which is used for representing the description sub-morphology of each point in the original image.
The image features are basic operation units of a plurality of computer vision tasks, are key information of computer mechanism solution image content, and by means of strong extraction capability of a convolutional neural network to deep image features in a wide perception domain, a feature processing algorithm based on deep learning can get rid of geometric rule constraint and weaken external environment interference, so that after a scene image is read in, an original image is convolutionally encoded through multiple convolution layers arranged in the main network to finish deep feature extraction work, and meanwhile, a shallow feature image in the main network is input into a space module, namely a space converter to encode space information to obtain a space conversion feature image out Spatial-Transformer Deep featuresThe main purpose of both extraction and spatial encoding is to provide a data basis for the subsequent feature detection and description tasks.
Deep feature extraction:
out 11 =ReLu(conv_11(raw_image))
out 12 =Maxpool(ReLu(conv_12(out 11 )))
out 21 =ReLu(conv_21(out 12 ))
out 22 =Maxpool(ReLu(conv_22(out 21 )))
out 31 =ReLu(conv_31(out 22 ))
out 32 =Maxpool(ReLu(conv_32(out 31 )))
out 41 =ReLu(conv_41(out 32 ))
out 42 =ReLu(conv_42(out 41 ))
the structure of the backbone network is shown in table 1 below.
TABLE 1
Figure BDA0003681107870000091
The detector branch consists of a plurality of convolution layers, the deep feature map is received at a first convolution layer of the detector branch, the space conversion feature map is received at a first convolution layer output position, fusion of first convolution layer output and the space conversion feature map is carried out, and the fusion comprises cascade connection of the first convolution layer output feature map and the space conversion feature map in the detector branch along a channel dimension. The spatial transformation profile is a profile describing a spatial transformation. The first convolution layer of the detector branch is the first convolution layer to process the input of the detector branch.
Because the pixel volume in the original image data is huge, in order to ensure algorithm instantaneity, the quantity control of the feature detection process is required to be carried out on the detector branches, in the invention, an 8-neighborhood method is adopted for feature detection, non-maximum suppression is adopted in an 8X 8 neighborhood to ensure the uniqueness of feature information, the detector branches compress the cascade feature images obtained after fusion to 65 channels by adopting a convolution method, then the data in the 65 channels are normalized, and the detector branches output angular point probability diagrams which are used for representing the probability that each point in the original image is an angular point. Specifically, among 65 channels, there are 64 channels whose values characterize the probability of feature points at 64 pixel locations within an 8×8 neighborhood, and another 1 channel whose values characterize the probability of no feature present within the 8×8 neighborhood. The output results of the detector branches are shown in fig. 6 (a) -6 (B).
The detector branches divide the original image by adopting an information quantity loss function and taking an 8 multiplied by 8 neighborhood as a basic unit in the training process to obtain a basic unit grid, and the total H in the grid is assumed C ×W C A plurality of base units, each base unit represented by x hw The true value data label set in the real scene data set is marked as Y, and the specific loss function of the detector branch is as follows:
Figure BDA0003681107870000101
Figure BDA0003681107870000102
wherein H is C As the total number of rows of the basic cell grid, W C For the total column number of the basic unit grid, h is the row index of the basic unit grid, w is the column index of the basic unit grid, y is the pixel position of the corner in one basic unit, l p Normalizing the network predicted value of the pixel position of the diagonal point and taking negative logarithm, x hwy Is the network predicted value, x, of the pixel position of the corner in a basic unit hwk K is the channel number, which is the network predictor at any pixel location in a base unit.
The outputs of the convolutions of the detector branches are:
out dect_1 =ReLu(conv_dect_1(out 42 ))
cascade superposition:
Figure BDA0003681107870000103
out dect_3 =ReLu(conv_dect_2(out dect_2 ))
out dect_final =Softmax(out dect_3 )
the detector finger network architecture is shown in table 2 below.
TABLE 2
Figure BDA0003681107870000111
The descriptor branch consists of a plurality of convolution layers, the deep feature map is received at a first convolution layer of the descriptor branch, the space conversion feature map is received at a first convolution layer output position, fusion of first convolution layer output and the space conversion feature map is carried out, and the fusion comprises cascade connection of the first convolution layer output feature map and the space feature map in the descriptor branch along a channel dimension. The first convolution layer of the descriptor branch is the first convolution layer to process the input of the descriptor branch.
For real-time reasons, the descriptor branches are also denoted by H C ×W C The method takes the basic units as operation primitives, develops the feature description work unit by unit, adopts 256-bit descriptors to characterize feature points, and takes the central point in an 8 multiplied by 8 neighborhood as a position reference to perform pixel-level descriptor interpolation calculation on the detected feature points so as to further improve the feature description precision.
As the identification mark of the feature point, the most important attribute of the descriptor is the individual specificity, the clearly distinguishable feature descriptor has important significance for feature matching and identification work, and is an important guarantee for accurately completing computer vision tasks such as vision positioning, image splicing, scene reconstruction and the like, therefore, the descriptor branches adopt a range-loss function in the training process, and the specific form is as follows:
descriptive subgraphs corresponding to the original image: d, homography transformation: h, the original image is subjected to homography transformation to obtain a description subgraph corresponding to the deformed image: d'.
Descriptors corresponding to the original image: d, d hw Descriptors corresponding to the deformed image: d' h′w′
8×8 neighborhood center pixel coordinates within the original image: p is p hw
8×8 neighborhood center pixel coordinates within the deformed image: p's' h′w′
Judging the corresponding relation:
Figure BDA0003681107870000121
Figure BDA0003681107870000122
l d (d hw ,d′ h′w′ ,s)
=λ d *s*max(0,m p -d hw T d′ h′w′ )+(1-s)*max(0,d hw T d′ h′w′ -m n )
wherein lambda is d ,m p And m n Lambda is an empirical threshold within the loss function d Is designed to balance the loss term sizes of the positive and negative pairs (s=1 point pair), ensuring that the network parameters decrease in the correct direction and m p And m n The establishment of the network parameter is aimed at controlling the network learning process, preventing the overfitting phenomenon caused by the network overlearning and ensuring the convergence of the network parameter to the proper value range. s is a corresponding relation judgment parameter, m p Is the forward edge parameter, d hw T Is d hw Is a transpose of (2); h is the row index of the basic unit grid corresponding to the original image, w is the column index of the basic unit grid corresponding to the original image, and h' is the deformed imageRow index of corresponding basic unit grid, w' is column index of basic unit grid corresponding to deformed image, m n Is a negative edge parameter.
The descriptor branch outputs are:
out descriptor_1 =ReLu(conv_descriptor_1(out 42 ))
cascade superposition:
Figure BDA0003681107870000123
out descriptor_3 =ReLu(conv_descriptor-2(out descriptor_2 ))
out descriptor_final =Normalize(out descriptor_3 )
the descriptor branch network structure is shown in table 3.
TABLE 3 Table 3
Figure BDA0003681107870000124
Figure BDA0003681107870000131
In the invention, in the detector branch, the cascade feature map is subjected to channel compression and then is subjected to feature position scoring, so that the feature position in the 8×8 neighborhood is determined, in the descriptor branch, the cascade feature map is compressed into 256-bit descriptors along the channel dimension, and then the descriptor interpolation is carried out by taking the center pixel coordinate of the 8×8 neighborhood as a position reference, so that the precision and the specificity of the descriptors are improved.
Further, as shown in fig. 3, the spatial module includes a plurality of convolution networks, a grid generator, a sampling network, and a sampler; the space module receives a shallow feature map in a main network as input, a six-degree-of-freedom affine transformation matrix is obtained through convolution operation of the convolution networks, the obtained six-degree-of-freedom affine transformation matrix is input into a grid generator, the grid generator generates grids to obtain a sampling grid, and the sampler samples pixels of the shallow feature map in the main network according to the sampling grid to obtain a space conversion feature map.
Further, the training of the processing system is five-stage training, a synthetic data set containing basic geometric patterns is used as a training sample, the data enhancement operation is carried out on the training sample data set in the first stage, and then the detector branches are independently trained by using the training sample data set; a second stage, marking the real scene data set by utilizing the detector branches obtained by the training in the first stage, and obtaining a feature marking data set in the real scene; a third stage, namely completely emptying the weight parameters of the detector branches obtained by training in the first stage, and independently retraining the detector branches by using the feature annotation data set obtained in the second stage; a fourth stage, re-labeling the real scene data set by using the detector branch obtained in the third stage to obtain a secondary labeling data set; and fifthly, emptying the weights of the detector branches and the descriptor branches, and performing joint training on the detector branches and the descriptor branches by using the secondary annotation data set to finally obtain a stable processing system.
The processing system integrally adopts a twin architecture design, is divided into a front-end main network, a rear-end detector branch and a descriptor branch in a structural hierarchy clearly, an original image is input into a main network (Backbone) part, the main network carries out image convolution processing on the original input image, a deep feature map is output, and the deep feature map is simultaneously submitted to the detector branch and the descriptor branch as shared information to carry out different tasks. Meanwhile, the backbone network is externally connected with a spatial processing module, the shallow feature map of the backbone network is separated independently and transmitted to the spatial processing module, the spatial processing module plays a role of a spatial transformer (Spatial Transformer), spatial information is obtained after the spatial processing module processes the shallow feature map, and the spatial information is encoded into the feature information to obtain a spatial conversion feature map. The detector branch and the descriptor branch are twin modules respectively used for feature position detection and feature description tasks, in the twin architecture, deep feature images output by a backbone network are respectively input into the detector branch and the descriptor branch and are cascaded with the space conversion feature images, under the guidance of a loss function, weight parameters in a processing system are continuously updated and iterated to obtain more accurate feature position and descriptor information, and the output end of the processing system is used for respectively outputting a corner probability image of 65 channels and a description subgraph of 256 channels.
In the invention, in order to prevent subjective interference caused by manual annotation data, a self-supervision mode is adopted to complete training, a training link is provided with 4 stages in total, a synthetic data set containing basic geometric patterns (polygons, lines and stars … …) is autonomously synthesized by a program in the first stage, as shown in fig. 4 (A) -4 (B), and data enhancement operations such as contrast adjustment, noise addition, motion blurring, brightness adjustment and the like are carried out on the data set, and then the data set is used for solely carrying out primary training on branches of a network detector; in the second stage, the detector branches obtained by the first stage preliminary training are utilized to label a real scene data set (figure 5) so as to obtain a characteristic labeling data set in a real scene; in the third stage, all weight parameters obtained by training in the first stage are emptied, and the real scene data set obtained in the second stage is utilized to carry out preliminary training on the network detector branches independently; in the fourth stage, the detector obtained in the third stage is utilized to remark the real scene data set, and the secondary annotation aims at further refining the quality of the data set and provides a basis for training in the final stage; and in the fifth stage, the weights are cleared again, and the high-quality data set obtained by the labeling in the fourth stage is used for carrying out joint training on the whole network structure (detector and descriptor), so that a complete and reliable feature processing network is finally obtained.
Aiming at the problems of unstable feature monitoring and poor descriptor repeatability of a visual feature extraction and description system (Visual Simultaneous localization and mapping, VSLAM) under weak textures, the invention provides a twin feature processing network based on a deep learning technology. The multi-layer convolutional neural network is arranged in the backbone network and used for extracting deep features of images, meanwhile, the space converter module is externally connected with the main body framework and used for explicitly coding space information, the space stability and sensitivity of the feature information are enhanced, and the deep feature map and the space conversion feature map of the images are subjected to cascade superposition in a rear-end branch to provide rich data for a network output layer. In the feature detector and the descriptor branch, the feature map is divided by taking an 8×8 neighborhood as a basic unit, the cascade feature map is compressed to 65 channels and 256 channels respectively, a probability scoring strategy is adopted in the detector branch to determine feature positions in the 8×8 neighborhood (the 65 th channel numerical value represents no feature probability in the neighborhood), and the descriptor branch adopts 256-bit descriptors to mark feature information. In order to overcome subjective errors caused by manual labeling features, the self-supervision labeling mode is adopted to construct the data labels, and meanwhile, the training process is finely divided into five stages to improve the data quality and enhance the network precision, so that the dilemma of sparse samples of data is effectively solved. By the vision characteristic processing system provided by the invention, multiple computer vision tasks such as vision positioning, scene reconstruction, image splicing and the like can be continuously and stably performed in a weak texture environment, and the original defects such as characteristic deletion, algorithm collapse and the like are relieved. The invention realizes function enhancement and simultaneously ensures system real-time performance to the greatest extent, the arrangement of 8×8 neighborhood effectively controls feature detection quantity at the detector level, and branches at the descriptor, in order to further improve feature description specificity without damaging algorithm real-time performance, the invention carries out descriptor interpolation calculation by taking 8×8 neighborhood center pixel coordinates as a position reference, and the invention has higher practical application value and superior potential compared with the existing mainstream feature algorithm under a typical weak texture scene through experimental test.
The invention also provides a visual characteristic processing method based on the neural network in the weak texture environment, which is based on the processing system, and comprises the following steps:
step S1: acquiring an original image;
step S2: inputting the original image into the processing system;
step S3: the processing system performs feature detection and description on the original image to obtain corner points and corresponding descriptors of the original image;
step S4: based on the corner points and descriptors of the original image, image stitching, visual positioning and scene recognition can be completed in a weak texture environment.
The above specific embodiments merely describe the design principle of the present invention, and the shapes of the components in the description may be different, and the names are not limited. Therefore, the technical scheme described in the foregoing embodiments can be modified or replaced equivalently by those skilled in the art; such modifications and substitutions do not depart from the spirit and technical scope of the invention, and all of them should be considered to fall within the scope of the invention.

Claims (6)

1. A neural network-based visual feature processing system in a weak texture environment, the processing system comprising:
the system comprises a backbone network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical sub-branches of a twin network;
the backbone network is used for receiving an input original image, carrying out convolution processing on the original image and outputting a deep feature map of the original image; the backbone network comprises a plurality of cascaded convolution layers, wherein shallow feature maps obtained after shallow convolution of the backbone network are simultaneously input into a first space module and a second space module; the first space module and the second space module are respectively used for space invariance reduction;
the detector branches comprise a plurality of cascade convolution layers, the output of the first space module is fused with the output of the first convolution layer in the detector branches, and the detector branches output corner probability diagrams which are used for representing the probability that each point in the original image is a corner;
the descriptor branch comprises a plurality of cascaded convolution layers, the output of the second space module is fused with the output of the first convolution layer in the descriptor branch, and the descriptor branch outputs a description subgraph which is used for representing the description sub-form of each point in the original image;
the detector branches are used for receiving deep feature images of the original images output by the backbone network;
the descriptor branch is used for receiving the deep feature map of the original image output by the backbone network.
2. The system of claim 1 wherein said detector branches divide said original image during training using an information loss function with an 8 x 8 neighborhood as a basis cell to obtain a grid of basis cells, assuming a total of H within the grid C ×W C A plurality of base units, each base unit represented by x hw The true value data label set in the real scene data set is marked as Y, and the specific loss function of the detector branch is as follows:
Figure QLYQS_1
Figure QLYQS_2
wherein H is C As the total number of rows of the basic cell grid, W C For the total column number of the basic unit grid, h is the row index of the basic unit grid, w is the column index of the basic unit grid, y is the pixel position of the corner in one basic unit, l p Normalizing the network predicted value of the pixel position of the diagonal point and taking negative logarithm, x hwy Is the network predicted value, x, of the pixel position of the corner in a basic unit hwk K is the channel number, which is the network predictor at any pixel location in a base unit.
3. The system of claim 2, wherein the descriptor branch employs a range-loss function during training in the form of:
descriptive subgraphs corresponding to the original image: d, homography transformation: h, the original image is subjected to homography transformation to obtain a description subgraph corresponding to the deformed image: d'.
Descriptors corresponding to the original image: d, d hw Descriptors corresponding to the deformed image: d' h′w′
8×8 neighborhood center pixel coordinates within the original image: p is p hw
8×8 neighborhood center pixel coordinates within the deformed image: p's' h′w′
Judging the corresponding relation:
Figure QLYQS_3
Figure QLYQS_4
l d (d hw ,d′ h′w′ ,s)=λ d *s*max(0,m p -d hw T d′ h′w′ )+(1-s)*max(0,d hw T d′ h′w′ -m n )
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure QLYQS_5
hp, for descriptor loss function hw Is the coordinate lambda of the 8X 8 neighborhood central pixel in the original image after homography transformation d For the weight parameter, s is the corresponding relation judgment parameter, m p Is the forward edge parameter, d hw T Is d hw Is a transpose of (2); h is the row index of the basic unit grid corresponding to the original image, w is the column index of the basic unit grid corresponding to the original image, h 'is the row index of the basic unit grid corresponding to the deformed image, w' is the column index of the basic unit grid corresponding to the deformed image, m n Is a negative edge parameter.
4. The system of claim 1, wherein the first spatial module and the second spatial module each comprise a plurality of convolutional networks, a grid generator, a sampling network, and a sampler; the space module receives a shallow feature map in a main network as input, a six-degree-of-freedom affine transformation matrix is obtained through convolution operation of the convolution networks, the obtained six-degree-of-freedom affine transformation matrix is input into a grid generator, the grid generator generates grids to obtain a sampling grid, and the sampler samples pixels of the shallow feature map in the main network according to the sampling grid to obtain a space conversion feature map.
5. The system of claim 1, wherein the training of the processing system is a five-stage training, a first stage of performing a data enhancement operation on a training sample dataset, followed by training the detector branches separately using the training sample dataset; a second stage, marking the real scene data set by utilizing the detector branches obtained by the training in the first stage, and obtaining a feature marking data set in the real scene; a third stage, namely completely emptying the weight parameters of the detector branches obtained by training in the first stage, and independently retraining the detector branches by using the feature annotation data set obtained in the second stage; a fourth stage, re-labeling the real scene data set by using the detector branch obtained in the third stage to obtain a secondary labeling data set; and a fifth stage, clearing the weight of the detector branch and the descriptor branch, and utilizing the secondary labeling data set to perform joint training on the detector branch and the descriptor branch.
6. A method for processing visual features based on a neural network in a weak texture environment, based on a processing system according to any one of claims 1-5, characterized in that the method comprises the steps of:
step S1: acquiring an original image;
step S2: inputting the original image into the processing system;
step S3: the processing system performs feature detection and description on the original image to obtain corner points and corresponding descriptors of the original image;
step S4: based on the corner points and descriptors of the original image, image stitching, visual positioning and scene recognition can be completed in a weak texture environment.
CN202210663043.2A 2022-06-07 2022-06-07 Visual characteristic processing system and method based on neural network in weak texture environment Active CN114937153B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210663043.2A CN114937153B (en) 2022-06-07 2022-06-07 Visual characteristic processing system and method based on neural network in weak texture environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210663043.2A CN114937153B (en) 2022-06-07 2022-06-07 Visual characteristic processing system and method based on neural network in weak texture environment

Publications (2)

Publication Number Publication Date
CN114937153A CN114937153A (en) 2022-08-23
CN114937153B true CN114937153B (en) 2023-06-30

Family

ID=82867108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210663043.2A Active CN114937153B (en) 2022-06-07 2022-06-07 Visual characteristic processing system and method based on neural network in weak texture environment

Country Status (1)

Country Link
CN (1) CN114937153B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117710467A (en) * 2024-02-06 2024-03-15 天津云圣智能科技有限责任公司 Unmanned plane positioning method, unmanned plane positioning equipment and aircraft

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861988A (en) * 2021-03-04 2021-05-28 西南科技大学 Feature matching method based on attention-seeking neural network
CN113066129A (en) * 2021-04-12 2021-07-02 北京理工大学 Visual positioning and mapping system based on target detection in dynamic environment
CN113610905A (en) * 2021-08-02 2021-11-05 北京航空航天大学 Deep learning remote sensing image registration method based on subimage matching and application

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111768432B (en) * 2020-06-30 2022-06-10 中国科学院自动化研究所 Moving target segmentation method and system based on twin deep neural network
CN113988269A (en) * 2021-11-05 2022-01-28 南通大学 Loop detection and optimization method based on improved twin network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861988A (en) * 2021-03-04 2021-05-28 西南科技大学 Feature matching method based on attention-seeking neural network
CN113066129A (en) * 2021-04-12 2021-07-02 北京理工大学 Visual positioning and mapping system based on target detection in dynamic environment
CN113610905A (en) * 2021-08-02 2021-11-05 北京航空航天大学 Deep learning remote sensing image registration method based on subimage matching and application

Also Published As

Publication number Publication date
CN114937153A (en) 2022-08-23

Similar Documents

Publication Publication Date Title
CN114255238A (en) Three-dimensional point cloud scene segmentation method and system fusing image features
CN107844743A (en) A kind of image multi-subtitle automatic generation method based on multiple dimensioned layering residual error network
CN110796018A (en) Hand motion recognition method based on depth image and color image
CN113177560A (en) Universal lightweight deep learning vehicle detection method
CN113449691A (en) Human shape recognition system and method based on non-local attention mechanism
CN112819080B (en) High-precision universal three-dimensional point cloud identification method
CN114937153B (en) Visual characteristic processing system and method based on neural network in weak texture environment
CN114863539A (en) Portrait key point detection method and system based on feature fusion
CN111488856B (en) Multimodal 2D and 3D facial expression recognition method based on orthogonal guide learning
CN116258990A (en) Cross-modal affinity-based small sample reference video target segmentation method
Wang et al. Accurate real-time ship target detection using Yolov4
Sun et al. IRDCLNet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes
CN111368733A (en) Three-dimensional hand posture estimation method based on label distribution learning, storage medium and terminal
CN113901928A (en) Target detection method based on dynamic super-resolution, and power transmission line component detection method and system
CN113489958A (en) Dynamic gesture recognition method and system based on video coding data multi-feature fusion
CN113505808A (en) Detection and identification algorithm for power distribution facility switch based on deep learning
CN111860361A (en) Green channel cargo scanning image entrainment automatic identifier and identification method
CN117011380A (en) 6D pose estimation method of target object
CN114494284B (en) Scene analysis model and method based on explicit supervision area relation
CN113486718B (en) Fingertip detection method based on deep multitask learning
CN115393735A (en) Remote sensing image building extraction method based on improved U-Net
Si et al. Image semantic segmentation based on improved DeepLab V3 model
Fu et al. Complementarity-aware Local-global Feature Fusion Network for Building Extraction in Remote Sensing Images
CN113538474A (en) 3D point cloud segmentation target detection system based on edge feature fusion
CN117274723B (en) Target identification method, system, medium and equipment for power transmission inspection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant