CN114937153B

CN114937153B - Visual characteristic processing system and method based on neural network in weak texture environment

Info

Publication number: CN114937153B
Application number: CN202210663043.2A
Authority: CN
Inventors: 方浩; 胡家瑞; 王奥博; 陈杰
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2023-06-30
Anticipated expiration: 2042-06-07
Also published as: CN114937153A

Abstract

The invention discloses a visual characteristic processing system and a visual characteristic processing method based on a neural network in a weak texture environment, wherein the processing system comprises the following steps: the system comprises a backbone network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical sub-branches of a twin network; the backbone network carries out convolution processing on an original image and outputs a deep feature map of the original image; the output of the first space module is fused with the output of a first convolution layer in the detector branch, and the detector branch outputs a corner probability map which is used for representing the probability that each point in the original image is a corner; the output of the second spatial module is fused with the output of the first convolution layer in the descriptor branch, and the descriptor branch outputs a description subgraph which is used for representing the description sub-form of each point in the original image.

Description

Visual characteristic processing system and method based on neural network in weak texture environment

Technical Field

The invention relates to the field of computer vision, in particular to a visual characteristic processing system and method based on a neural network in a weak texture environment.

Background

In recent years, artificial intelligence is actively developed, global automation patterns are gradually formed, computer vision is used as one of core perception technologies, considerable social, economic and academic values are created, social application depth and breadth are continuously enhanced, and industries such as security, medicine, agriculture and forestry, manufacturing and the like gradually enter into a visual intelligence era, so that computer vision is an indispensable precursor technology in intelligent innovation. The feature information is crucial in the technical process of realizing visual enabling, and is a key mark for the computing system to understand and identify the image. Researchers propose rich feature design schemes based on graphics, and feature information with distinction and repeatability provides good operation primitives for visual tasks such as image retrieval, image splicing, VSLAM, three-dimensional reconstruction and the like, wherein the VSLAM scheme endows intelligent bodies such as unmanned planes, unmanned vehicles and the like with self-positioning and environment perception capabilities, and is an important technical drive for promoting intelligent unmanned construction. However, the traditional image features based on geometry are excessively dependent on image quality, are naturally very sensitive to imaging environment changes, and can cause feature quality degradation when facing to the common weak texture severe scene shown in fig. 1, so that feature algorithm failure is caused, and tasks such as VSLAM and the like are collapsed. The feature processing technology still has obvious defects in the aspects of resisting environmental interference, coping with equipment noise and adapting to motion change, and technological innovation and product incubation industry bring out an urgent need for robust and accurate feature extraction and description algorithms. Aiming at visual positioning and mapping tasks in a weak texture environment, the existing solutions are as follows:

scheme 1: yiK M, etc., LIFT Learned Invariant Feature Transform [ J ]. The proposal utilizes a motion structure recovery method to construct a supervision signal to make up for the data deficiency, and realizes the interactive connection and End-to-End synchronous learning of three subtask networks (a detector, a direction estimator and a descriptor) under a unified framework. However, the sub-networks in the LIFT model fail to form a calculation sharing relation, so that the LIFT features are difficult to meet the real-time application requirements.

Scheme 2: detone D et al, superPoint: self-Supervised Interest Point Detection and Description [ J ]. According to the scheme, a twin neural network design is adopted, calculation sharing between a detection network and a description network is basically realized, the real-time performance of the method is excellent, a self-labeling method is adopted in the SuperPoint scheme to obtain training samples, a labeling device and homography transformation are utilized to label an original image to finish false true values, and the SuperPoint network can output denser and more repeatable image features due to the homography transformation mechanism. However, this work uses implicit methods to model spatial characteristics, which are not ideal in effect.

Scheme 3: dusmanu M et al, A trainable cnn for joint description and detection of local features [ C ]. The scheme provides a concept of synchronous detection and description, breaks through the traditional mode of 'first detection and then description' in the time dimension, and the network output simultaneously comprises characteristic position score and descriptor information, so that the D2-Net truly realizes complete fusion of the detector and the descriptor, and achieves good effects in the network efficiency level. However, D2-Net performs poorly in terms of feature accuracy.

Disclosure of Invention

In view of the above, the present invention provides a system and a method for processing visual features based on a neural network in a weak texture environment, which can solve the technical problem of how to reduce the interference of the weak texture scene to the feature extraction and description process in the existing weak texture environment.

The present invention is so implemented as to solve the above-mentioned technical problems.

A neural network-based visual feature processing system in a weak texture environment, comprising:

the system comprises a backbone network, a detector branch and a descriptor branch, wherein the detector branch and the descriptor branch are symmetrical sub-branches of a twin network;

the backbone network is used for receiving an input original image, carrying out convolution processing on the original image and outputting a deep feature map of the original image; the backbone network comprises a plurality of cascaded convolution layers, wherein shallow feature maps obtained after shallow convolution of the backbone network are simultaneously input into a first space module and a second space module; the first space module and the second space module are respectively used for space invariance reduction;

the detector branches comprise a plurality of cascade convolution layers, the output of the first space module is fused with the output of the first convolution layer in the detector branches, and the detector branches output corner probability diagrams which are used for representing the probability that each point in the original image is a corner;

the descriptor branch comprises a plurality of cascaded convolution layers, the output of the second spatial module is fused with the output of the first convolution layer in the descriptor branch, and the descriptor branch outputs a description subgraph which is used for representing the description sub-morphology of each point in the original image.

Preferably, the detector branch is configured to receive a deep feature map of the original image output by the backbone network, where the detector branch includes a plurality of cascaded convolution layers, and the output of the first spatial module is fused with the output of the first convolution layer in the detector branch, and the detector branch outputs a corner probability map, where the corner probability map is used to characterize a probability that each point in the original image is a corner;

the descriptor branch is used for receiving a deep feature map of the original image output by the backbone network, the descriptor branch comprises a plurality of cascaded convolution layers, wherein the output of the second space module is fused with the output of the first convolution layer in the descriptor branch, and the descriptor branch outputs a description subgraph which is used for representing the description sub-morphology of each point in the original image.

Preferably, the detector branches divide the original image by using an information quantity loss function and using an 8×8 neighborhood as a base unit in the training process to obtain a base unit grid, and the total H in the grid is assumed _C ×W _C A plurality of base units, each base unit represented by x _hw The true value data label set in the real scene data set is marked as Y, and the specific loss function of the detector branch is as follows:

wherein H is _C As the total number of rows of the basic cell grid, W _C For the total column number of the basic unit grid, h is the row index of the basic unit grid, w is the column index of the basic unit grid, y is the pixel position of the corner in one basic unit, l _p Normalizing the network predicted value of the pixel position of the diagonal point and taking negative logarithm, x _hwy Is the network predicted value, x, of the pixel position of the corner in a basic unit _hwk K is the channel number, which is the network predictor at any pixel location in a base unit.

Preferably, the descriptor branch adopts a range-loss function in the training process, and the specific form is as follows:

descriptive subgraphs corresponding to the original image: d, homography transformation: h, the original image is subjected to homography transformation to obtain a description subgraph corresponding to the deformed image: d'.

Descriptors corresponding to the original image: d, d _hw Descriptors corresponding to the deformed image: d' _h′w′

8×8 neighborhood center pixel coordinates within the original image: p is p _hw

8×8 neighborhood center pixel coordinates within the deformed image: p's' _h′w′

Judging the corresponding relation:

l _d (d _hw ，d′ _h′w′ ，s)

＝λ _d *s*max(0，m _p -d _hw ^T d′ _h′w′ )+(1-s)*max(0，d _hw ^T d′ _h′w′ -m _n )

wherein, the liquid crystal display device comprises a liquid crystal display device,

hp, for descriptor loss function _hw Is the coordinate lambda of the 8X 8 neighborhood central pixel in the original image after homography transformation _d For the weight parameter, s is the corresponding relation judgment parameter, m _p Is the forward edge parameter, d _hw ^T Is d _hw Is a transpose of (2); h is the row index of the basic unit grid corresponding to the original image, w is the column index of the basic unit grid corresponding to the original image, h 'is the row index of the basic unit grid corresponding to the deformed image, w' is the column index of the basic unit grid corresponding to the deformed image, m _n Is a negative edge parameterA number.

Further, the first space module and the second space module each comprise a plurality of convolution networks, a grid generator, a sampling network and a sampler; the space module receives a shallow feature map in a main network as input, a six-degree-of-freedom affine transformation matrix is obtained through convolution operation of the convolution networks, the obtained six-degree-of-freedom affine transformation matrix is input into a grid generator, the grid generator generates grids to obtain a sampling grid, and the sampler samples pixels of the shallow feature map in the main network according to the sampling grid to obtain a space conversion feature map.

Further, the training of the processing system is five-stage training, wherein in the first stage, data enhancement operation is performed on a training sample data set, and then the detector branches are independently trained by using the training sample data set; a second stage, marking the real scene data set by utilizing the detector branches obtained by the training in the first stage, and obtaining a feature marking data set in the real scene; a third stage, namely completely emptying the weight parameters of the detector branches obtained by training in the first stage, and independently retraining the detector branches by using the feature annotation data set obtained in the second stage; a fourth stage, re-labeling the real scene data set by using the detector branch obtained in the third stage to obtain a secondary labeling data set; and a fifth stage, clearing the weight of the detector branch and the descriptor branch, and utilizing the secondary labeling data set to perform joint training on the detector branch and the descriptor branch.

The invention provides a visual characteristic processing method based on a neural network in a weak texture environment, which comprises the following steps:

step S1: acquiring an original image;

step S2: inputting the original image into the processing system;

step S3: the processing system performs feature detection and description on the original image to obtain corner points and corresponding descriptors of the original image;

step S4: based on the corner points and descriptors of the original image, image stitching, visual positioning and scene recognition can be completed in a weak texture environment.

The beneficial effects are that:

the invention fully plays the advantages of the deep learning method, guides the network to pay attention to the scene area with rich texture information in a data driving mode, and enhances the overall spatial stability and sensitivity of the network by adding a spatial processing module in a targeted manner. In the invention, the space module is connected into the twin part in a layer jump connection mode, and the space adaptation capacity of the network model is expanded on the premise of guaranteeing the authenticity of the deep features of the image.

The method has the following technical effects:

(1) The invention provides a visual feature processing system based on a neural network to reduce the adverse effect of a weak texture scene on visual feature processing work, and the data driving method breaks through the geometric rule constraint suffered by the traditional feature algorithm, so that the real-time performance is ensured, and the image information utilization rate is further improved.

(2) According to the invention, a space converter module is introduced, and the converted space conversion feature map and the original feature map are subjected to cascade superposition, so that the processing method explicitly models the space characteristics of the image, and compared with the implicit modeling method in the prior work, the processing method has more excellent performance.

(3) The invention adopts the self-supervision labeling method to complete training, solves the problems of subjective errors and sample deletion of manual labeling, further improves the data utilization rate, fully develops the potential of a network structure, furthest reduces the damage of scene limitation to a feature network, and has remarkable significance for enhancing the practical application value of the deep learning technology in the problems of feature extraction and description.

(4) According to the invention, through the twin neural network architecture and the self-supervision labeling training strategy, strong constraint brought by geometric rules in the feature extraction process is eliminated, so that the network has excellent robustness, flexibility and scene adaptability, and the external environment interference and algorithm complexity are reduced.

(5) The processing system is a twin architecture, a feature processing algorithm is adopted, a standard system integrating feature extraction and description is established, and a space module is explicitly modeled by a structure shown in fig. 3, so that the specificity and the space quality of the extracted scene features are ensured.

Drawings

FIG. 1 is a schematic diagram of a weak texture scene;

FIG. 2 is a schematic diagram of a visual feature processing system architecture based on a neural network in a weak texture environment provided by the present invention;

FIG. 3 is a schematic diagram of a space module architecture according to the present invention;

FIGS. 4 (A) -4 (B) are schematic diagrams of synthetic datasets provided by the present invention;

FIG. 5 is a schematic representation of a real scene data set used in the present invention;

fig. 6 (a) -6 (B) are diagrams of output results of the detector provided by the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and examples.

As shown in fig. 2-3, the present invention proposes a visual feature processing system based on a neural network in a weak texture environment, where the processing system includes:

The shallow convolution means that the image is only processed by partial convolution layers and does not reach the deep degree.

Further, the detector branch is configured to receive a deep feature map of the original image output by the backbone network, where the detector branch includes a plurality of cascaded convolution layers, and the output of the first spatial module is fused with the output of the first convolution layer in the detector branch, and the detector branch outputs a corner probability map, where the corner probability map is used to characterize a probability that each point in the original image is a corner;

The image features are basic operation units of a plurality of computer vision tasks, are key information of computer mechanism solution image content, and by means of strong extraction capability of a convolutional neural network to deep image features in a wide perception domain, a feature processing algorithm based on deep learning can get rid of geometric rule constraint and weaken external environment interference, so that after a scene image is read in, an original image is convolutionally encoded through multiple convolution layers arranged in the main network to finish deep feature extraction work, and meanwhile, a shallow feature image in the main network is input into a space module, namely a space converter to encode space information to obtain a space conversion feature image out _{Spatial-Transformer} Deep featuresThe main purpose of both extraction and spatial encoding is to provide a data basis for the subsequent feature detection and description tasks.

Deep feature extraction:

out ₁₁ ＝ReLu(conv_11(raw_image))

out ₁₂ ＝Maxpool(ReLu(conv_12(out ₁₁ )))

out ₂₁ ＝ReLu(conv_21(out ₁₂ ))

out ₂₂ ＝Maxpool(ReLu(conv_22(out ₂₁ )))

out ₃₁ ＝ReLu(conv_31(out ₂₂ ))

out ₃₂ ＝Maxpool(ReLu(conv_32(out ₃₁ )))

out ₄₁ ＝ReLu(conv_41(out ₃₂ ))

out ₄₂ ＝ReLu(conv_42(out ₄₁ ))

the structure of the backbone network is shown in table 1 below.

TABLE 1

The detector branch consists of a plurality of convolution layers, the deep feature map is received at a first convolution layer of the detector branch, the space conversion feature map is received at a first convolution layer output position, fusion of first convolution layer output and the space conversion feature map is carried out, and the fusion comprises cascade connection of the first convolution layer output feature map and the space conversion feature map in the detector branch along a channel dimension. The spatial transformation profile is a profile describing a spatial transformation. The first convolution layer of the detector branch is the first convolution layer to process the input of the detector branch.

Because the pixel volume in the original image data is huge, in order to ensure algorithm instantaneity, the quantity control of the feature detection process is required to be carried out on the detector branches, in the invention, an 8-neighborhood method is adopted for feature detection, non-maximum suppression is adopted in an 8X 8 neighborhood to ensure the uniqueness of feature information, the detector branches compress the cascade feature images obtained after fusion to 65 channels by adopting a convolution method, then the data in the 65 channels are normalized, and the detector branches output angular point probability diagrams which are used for representing the probability that each point in the original image is an angular point. Specifically, among 65 channels, there are 64 channels whose values characterize the probability of feature points at 64 pixel locations within an 8×8 neighborhood, and another 1 channel whose values characterize the probability of no feature present within the 8×8 neighborhood. The output results of the detector branches are shown in fig. 6 (a) -6 (B).

The detector branches divide the original image by adopting an information quantity loss function and taking an 8 multiplied by 8 neighborhood as a basic unit in the training process to obtain a basic unit grid, and the total H in the grid is assumed _C ×W _C A plurality of base units, each base unit represented by x _hw The true value data label set in the real scene data set is marked as Y, and the specific loss function of the detector branch is as follows:

The outputs of the convolutions of the detector branches are:

out _{dect_1} ＝ReLu(conv_dect_1(out ₄₂ ))

cascade superposition:

out _{dect_3} ＝ReLu(conv_dect_2(out _{dect_2} ))

out _{dect_final} ＝Softmax(out _{dect_3} )

the detector finger network architecture is shown in table 2 below.

TABLE 2

The descriptor branch consists of a plurality of convolution layers, the deep feature map is received at a first convolution layer of the descriptor branch, the space conversion feature map is received at a first convolution layer output position, fusion of first convolution layer output and the space conversion feature map is carried out, and the fusion comprises cascade connection of the first convolution layer output feature map and the space feature map in the descriptor branch along a channel dimension. The first convolution layer of the descriptor branch is the first convolution layer to process the input of the descriptor branch.

For real-time reasons, the descriptor branches are also denoted by H _C ×W _C The method takes the basic units as operation primitives, develops the feature description work unit by unit, adopts 256-bit descriptors to characterize feature points, and takes the central point in an 8 multiplied by 8 neighborhood as a position reference to perform pixel-level descriptor interpolation calculation on the detected feature points so as to further improve the feature description precision.

As the identification mark of the feature point, the most important attribute of the descriptor is the individual specificity, the clearly distinguishable feature descriptor has important significance for feature matching and identification work, and is an important guarantee for accurately completing computer vision tasks such as vision positioning, image splicing, scene reconstruction and the like, therefore, the descriptor branches adopt a range-loss function in the training process, and the specific form is as follows:

Judging the corresponding relation:

l _d (d _hw ，d′ _h′w′ ，s)

wherein lambda is _d ，m _p And m _n Lambda is an empirical threshold within the loss function _d Is designed to balance the loss term sizes of the positive and negative pairs (s=1 point pair), ensuring that the network parameters decrease in the correct direction and m _p And m _n The establishment of the network parameter is aimed at controlling the network learning process, preventing the overfitting phenomenon caused by the network overlearning and ensuring the convergence of the network parameter to the proper value range. s is a corresponding relation judgment parameter, m _p Is the forward edge parameter, d _hw ^T Is d _hw Is a transpose of (2); h is the row index of the basic unit grid corresponding to the original image, w is the column index of the basic unit grid corresponding to the original image, and h' is the deformed imageRow index of corresponding basic unit grid, w' is column index of basic unit grid corresponding to deformed image, m _n Is a negative edge parameter.

The descriptor branch outputs are:

out _{descriptor_1} ＝ReLu(conv_descriptor_1(out ₄₂ ))

cascade superposition:

out _{descriptor_3} ＝ReLu(conv_descriptor-2(out _{descriptor_2} ))

out _{descriptor_final} ＝Normalize(out _{descriptor_3} )

the descriptor branch network structure is shown in table 3.

TABLE 3 Table 3

In the invention, in the detector branch, the cascade feature map is subjected to channel compression and then is subjected to feature position scoring, so that the feature position in the 8×8 neighborhood is determined, in the descriptor branch, the cascade feature map is compressed into 256-bit descriptors along the channel dimension, and then the descriptor interpolation is carried out by taking the center pixel coordinate of the 8×8 neighborhood as a position reference, so that the precision and the specificity of the descriptors are improved.

Further, as shown in fig. 3, the spatial module includes a plurality of convolution networks, a grid generator, a sampling network, and a sampler; the space module receives a shallow feature map in a main network as input, a six-degree-of-freedom affine transformation matrix is obtained through convolution operation of the convolution networks, the obtained six-degree-of-freedom affine transformation matrix is input into a grid generator, the grid generator generates grids to obtain a sampling grid, and the sampler samples pixels of the shallow feature map in the main network according to the sampling grid to obtain a space conversion feature map.

Further, the training of the processing system is five-stage training, a synthetic data set containing basic geometric patterns is used as a training sample, the data enhancement operation is carried out on the training sample data set in the first stage, and then the detector branches are independently trained by using the training sample data set; a second stage, marking the real scene data set by utilizing the detector branches obtained by the training in the first stage, and obtaining a feature marking data set in the real scene; a third stage, namely completely emptying the weight parameters of the detector branches obtained by training in the first stage, and independently retraining the detector branches by using the feature annotation data set obtained in the second stage; a fourth stage, re-labeling the real scene data set by using the detector branch obtained in the third stage to obtain a secondary labeling data set; and fifthly, emptying the weights of the detector branches and the descriptor branches, and performing joint training on the detector branches and the descriptor branches by using the secondary annotation data set to finally obtain a stable processing system.

The processing system integrally adopts a twin architecture design, is divided into a front-end main network, a rear-end detector branch and a descriptor branch in a structural hierarchy clearly, an original image is input into a main network (Backbone) part, the main network carries out image convolution processing on the original input image, a deep feature map is output, and the deep feature map is simultaneously submitted to the detector branch and the descriptor branch as shared information to carry out different tasks. Meanwhile, the backbone network is externally connected with a spatial processing module, the shallow feature map of the backbone network is separated independently and transmitted to the spatial processing module, the spatial processing module plays a role of a spatial transformer (Spatial Transformer), spatial information is obtained after the spatial processing module processes the shallow feature map, and the spatial information is encoded into the feature information to obtain a spatial conversion feature map. The detector branch and the descriptor branch are twin modules respectively used for feature position detection and feature description tasks, in the twin architecture, deep feature images output by a backbone network are respectively input into the detector branch and the descriptor branch and are cascaded with the space conversion feature images, under the guidance of a loss function, weight parameters in a processing system are continuously updated and iterated to obtain more accurate feature position and descriptor information, and the output end of the processing system is used for respectively outputting a corner probability image of 65 channels and a description subgraph of 256 channels.

In the invention, in order to prevent subjective interference caused by manual annotation data, a self-supervision mode is adopted to complete training, a training link is provided with 4 stages in total, a synthetic data set containing basic geometric patterns (polygons, lines and stars … …) is autonomously synthesized by a program in the first stage, as shown in fig. 4 (A) -4 (B), and data enhancement operations such as contrast adjustment, noise addition, motion blurring, brightness adjustment and the like are carried out on the data set, and then the data set is used for solely carrying out primary training on branches of a network detector; in the second stage, the detector branches obtained by the first stage preliminary training are utilized to label a real scene data set (figure 5) so as to obtain a characteristic labeling data set in a real scene; in the third stage, all weight parameters obtained by training in the first stage are emptied, and the real scene data set obtained in the second stage is utilized to carry out preliminary training on the network detector branches independently; in the fourth stage, the detector obtained in the third stage is utilized to remark the real scene data set, and the secondary annotation aims at further refining the quality of the data set and provides a basis for training in the final stage; and in the fifth stage, the weights are cleared again, and the high-quality data set obtained by the labeling in the fourth stage is used for carrying out joint training on the whole network structure (detector and descriptor), so that a complete and reliable feature processing network is finally obtained.

Aiming at the problems of unstable feature monitoring and poor descriptor repeatability of a visual feature extraction and description system (Visual Simultaneous localization and mapping, VSLAM) under weak textures, the invention provides a twin feature processing network based on a deep learning technology. The multi-layer convolutional neural network is arranged in the backbone network and used for extracting deep features of images, meanwhile, the space converter module is externally connected with the main body framework and used for explicitly coding space information, the space stability and sensitivity of the feature information are enhanced, and the deep feature map and the space conversion feature map of the images are subjected to cascade superposition in a rear-end branch to provide rich data for a network output layer. In the feature detector and the descriptor branch, the feature map is divided by taking an 8×8 neighborhood as a basic unit, the cascade feature map is compressed to 65 channels and 256 channels respectively, a probability scoring strategy is adopted in the detector branch to determine feature positions in the 8×8 neighborhood (the 65 th channel numerical value represents no feature probability in the neighborhood), and the descriptor branch adopts 256-bit descriptors to mark feature information. In order to overcome subjective errors caused by manual labeling features, the self-supervision labeling mode is adopted to construct the data labels, and meanwhile, the training process is finely divided into five stages to improve the data quality and enhance the network precision, so that the dilemma of sparse samples of data is effectively solved. By the vision characteristic processing system provided by the invention, multiple computer vision tasks such as vision positioning, scene reconstruction, image splicing and the like can be continuously and stably performed in a weak texture environment, and the original defects such as characteristic deletion, algorithm collapse and the like are relieved. The invention realizes function enhancement and simultaneously ensures system real-time performance to the greatest extent, the arrangement of 8×8 neighborhood effectively controls feature detection quantity at the detector level, and branches at the descriptor, in order to further improve feature description specificity without damaging algorithm real-time performance, the invention carries out descriptor interpolation calculation by taking 8×8 neighborhood center pixel coordinates as a position reference, and the invention has higher practical application value and superior potential compared with the existing mainstream feature algorithm under a typical weak texture scene through experimental test.

The invention also provides a visual characteristic processing method based on the neural network in the weak texture environment, which is based on the processing system, and comprises the following steps:

step S1: acquiring an original image;

step S2: inputting the original image into the processing system;

The above specific embodiments merely describe the design principle of the present invention, and the shapes of the components in the description may be different, and the names are not limited. Therefore, the technical scheme described in the foregoing embodiments can be modified or replaced equivalently by those skilled in the art; such modifications and substitutions do not depart from the spirit and technical scope of the invention, and all of them should be considered to fall within the scope of the invention.

Claims

1. A neural network-based visual feature processing system in a weak texture environment, the processing system comprising:

the descriptor branch comprises a plurality of cascaded convolution layers, the output of the second space module is fused with the output of the first convolution layer in the descriptor branch, and the descriptor branch outputs a description subgraph which is used for representing the description sub-form of each point in the original image;

the detector branches are used for receiving deep feature images of the original images output by the backbone network;

the descriptor branch is used for receiving the deep feature map of the original image output by the backbone network.

2. The system of claim 1 wherein said detector branches divide said original image during training using an information loss function with an 8 x 8 neighborhood as a basis cell to obtain a grid of basis cells, assuming a total of H within the grid _C ×W _C A plurality of base units, each base unit represented by x _hw The true value data label set in the real scene data set is marked as Y, and the specific loss function of the detector branch is as follows:

3. The system of claim 2, wherein the descriptor branch employs a range-loss function during training in the form of:

Judging the corresponding relation:

l _d (d _hw ，d′ _h′w′ ，s)＝λ _d *s*max(0，m _p -d _hw ^T d′ _h′w′ )+(1-s)*max(0，d _hw ^T d′ _h′w′ -m _n )

hp, for descriptor loss function _hw Is the coordinate lambda of the 8X 8 neighborhood central pixel in the original image after homography transformation _d For the weight parameter, s is the corresponding relation judgment parameter, m _p Is the forward edge parameter, d _hw ^T Is d _hw Is a transpose of (2); h is the row index of the basic unit grid corresponding to the original image, w is the column index of the basic unit grid corresponding to the original image, h 'is the row index of the basic unit grid corresponding to the deformed image, w' is the column index of the basic unit grid corresponding to the deformed image, m _n Is a negative edge parameter.

4. The system of claim 1, wherein the first spatial module and the second spatial module each comprise a plurality of convolutional networks, a grid generator, a sampling network, and a sampler; the space module receives a shallow feature map in a main network as input, a six-degree-of-freedom affine transformation matrix is obtained through convolution operation of the convolution networks, the obtained six-degree-of-freedom affine transformation matrix is input into a grid generator, the grid generator generates grids to obtain a sampling grid, and the sampler samples pixels of the shallow feature map in the main network according to the sampling grid to obtain a space conversion feature map.

5. The system of claim 1, wherein the training of the processing system is a five-stage training, a first stage of performing a data enhancement operation on a training sample dataset, followed by training the detector branches separately using the training sample dataset; a second stage, marking the real scene data set by utilizing the detector branches obtained by the training in the first stage, and obtaining a feature marking data set in the real scene; a third stage, namely completely emptying the weight parameters of the detector branches obtained by training in the first stage, and independently retraining the detector branches by using the feature annotation data set obtained in the second stage; a fourth stage, re-labeling the real scene data set by using the detector branch obtained in the third stage to obtain a secondary labeling data set; and a fifth stage, clearing the weight of the detector branch and the descriptor branch, and utilizing the secondary labeling data set to perform joint training on the detector branch and the descriptor branch.

6. A method for processing visual features based on a neural network in a weak texture environment, based on a processing system according to any one of claims 1-5, characterized in that the method comprises the steps of:

step S1: acquiring an original image;

step S2: inputting the original image into the processing system;